Open mike-lawrence opened 2 years ago
The progressr package is not aware of time. The task for estimating ETA is passed on to the underlying progress handler, e.g. the progress package (as here). This was a deliberate choice to keep the progressr API as simple as possible, especially in its early stage of life.
One idea I had in the past related to this was to signal that some progress steps are skipped whereas others are not (the default), e.g.
pb(skip = TRUE)
or possibly
pb(skip = 0.9)
where skip
is a fraction in [0,1] where skip=1.0
is full skip ("zero cost"), and skip=0.9
(10% cost), etc. (Maybe, cost
is a better name, where cost = 1.0
is the default)
Then the progress handler can do whatever they'd like with that info, e.g. adjust their ETA estimates, or completely ignore it. See also https://github.com/r-lib/progress/issues/120 for my proposal to the progress package. This is an idea that needs to maturity and be explored, e.g. is skip
the most generic approach (where ETA is just a special case), is it sufficient, or do we need more.
progressr::handlers('cli')
supports very accurate ETA estimates by default with no further code needed. I am especially surprised about how good the estimates are even with parallel processing with the {future}
framework.
progressr::handlers('cli')
supports very accurate ETA estimates by default with no further code needed. I am especially surprised about how good the estimates are even with parallel processing with the{future}
framework.
Thanks for sharing your experience and observations.
FWIW, it makes sense that the ETA works equally well in parallel. cli cannot even tell if we're running sequentially or parallelly; all it sees is an incoming stream of progress updates. The only difference, when running in parallel, is that they'll arrive more frequently. So, cli just concludes that things are running faster and adjusts its ETA accordingly.
I'm working on a pipeline-prototyping package that uses progressr to track progress as many elements of a list are processed and where efficient re-processing is enabled by skipping over already processed elements. This results in a scenario where I gather the current ETA calculations of progressr become inaccurate as early elements that have already been successfully processed return from the processing function quickly while later elements can be expected to take much longer. I save history data associated with each element, including processing time duration, and I'm wondering if it'd be possible to provide that duration to progressr somehow to leave it with more accurate ETA estimates. A minimal example of my case is: