futureverse / progressr

三 R package: An Inclusive, Unifying API for Progress Updates
https://progressr.futureverse.org
281 stars 12 forks source link

Any way to inform ETA estimate with prior info? #142

Open mike-lawrence opened 2 years ago

mike-lawrence commented 2 years ago

I'm working on a pipeline-prototyping package that uses progressr to track progress as many elements of a list are processed and where efficient re-processing is enabled by skipping over already processed elements. This results in a scenario where I gather the current ETA calculations of progressr become inaccurate as early elements that have already been successfully processed return from the processing function quickly while later elements can be expected to take much longer. I save history data associated with each element, including processing time duration, and I'm wondering if it'd be possible to provide that duration to progressr somehow to leave it with more accurate ETA estimates. A minimal example of my case is:

options(
    progressr.handlers = progressr::handler_progress(
        format   = "[:bar] :spin :current/:total :percent in :elapsed ETA: :eta (:message)"
        , clear = FALSE
    )
)
options(progressr.clear=FALSE)
f = function(.x,pb){
    x_hash = digest::digest(.x)
    if(file.exists(x_hash)){
        t = readRDS(x_hash)
    }else{
        t = runif(1,10,20)
        Sys.sleep(t)
        if(.x%%2){  # odd elements get saved, triggering skip next time they're processed
            saveRDS(t,x_hash)
        }
    }
    pb(duration = t) #imagined API for supplying the duration manually

}

#on first run, ETA is accurate
progressr::with_progress({
    .x = 1:10
    pb <- progressr::progressor(along = .x)
    y = furrr::future_map(.x=.x,.f=f,pb=pb,.progress=F)
})

#on second run, some elements are skipped internally by f()
progressr::with_progress({
    .x = 1:10
    pb <- progressr::progressor(along = .x)
    y = furrr::future_map(.x=.x,.f=f,pb=pb,.progress=F)
})
HenrikBengtsson commented 2 years ago

The progressr package is not aware of time. The task for estimating ETA is passed on to the underlying progress handler, e.g. the progress package (as here). This was a deliberate choice to keep the progressr API as simple as possible, especially in its early stage of life.

One idea I had in the past related to this was to signal that some progress steps are skipped whereas others are not (the default), e.g.

pb(skip = TRUE)

or possibly

pb(skip = 0.9)

where skip is a fraction in [0,1] where skip=1.0 is full skip ("zero cost"), and skip=0.9 (10% cost), etc. (Maybe, cost is a better name, where cost = 1.0 is the default)

Then the progress handler can do whatever they'd like with that info, e.g. adjust their ETA estimates, or completely ignore it. See also https://github.com/r-lib/progress/issues/120 for my proposal to the progress package. This is an idea that needs to maturity and be explored, e.g. is skip the most generic approach (where ETA is just a special case), is it sufficient, or do we need more.

tripartio commented 9 months ago

progressr::handlers('cli') supports very accurate ETA estimates by default with no further code needed. I am especially surprised about how good the estimates are even with parallel processing with the {future} framework.

HenrikBengtsson commented 9 months ago

progressr::handlers('cli') supports very accurate ETA estimates by default with no further code needed. I am especially surprised about how good the estimates are even with parallel processing with the {future} framework.

Thanks for sharing your experience and observations.

FWIW, it makes sense that the ETA works equally well in parallel. cli cannot even tell if we're running sequentially or parallelly; all it sees is an incoming stream of progress updates. The only difference, when running in parallel, is that they'll arrive more frequently. So, cli just concludes that things are running faster and adjusts its ETA accordingly.