Managing user expectations for completion in constrained resource environments (feedback systems!)

At the HL-LHC it is very likely that a single run of an analysis will face resource starvation or occasional data outage. While everything we've built thus far brings analysis turn around at HL-LHC to ~1 week (which much better back-of-the-envelope), we do need to start thinking about how to present more introspection about how a user's job will complete.

Taking the concrete example of dask-awkward's necessary_columns optimization, we know ahead of execution what data are required and can calculate with metadata alone the total data needed by the analysis (without predicate pushdown) and thus an expected time (most analyses are i/o bound by far, even with ML inference involved). It would be interesting to try to flesh this out all the way and see how accurate it is!

Going a bit further there's the idea of the probe job that would allow us to understand why happens when we have predicate pushdown, running on a fraction of the data to understand basic characteristics of the proposed workflow. This could let us understand resource availability locally and then project to needs over a full dataset. If we're careful about this it would let us estimate turn-around times given how much of a dataset would be processed and suggest that to the user as alternatives to complete execution.

Long story short - partial execution of analysis is going to probably be the common mode in the HL-LHC. We have the tools to understand the characteristics of an analysis (data needs, resource needs, etc.) and present that to the user as a tunable knob. We should talk about how we want to present this large amount of information in a reasonable way that lets the users be successful in terms of achieving their analysis goals and generally doing good science.

HSF / PyHEP.dev-workshops

Managing user expectations for completion in constrained resource environments (feedback systems!) #25