ETA times go up instead of down

sharkdp commented 1 year ago

Thank you for providing and maintaining this crate.

We have the following use case: we run benchmarks where each iteration takes the same time (approximately), but the iteration time can be anything from a few milliseconds to several hours. Especially for longer-running benchmarks (> a few seconds), we want to show an ETA to the user.

Recently, we noticed a regression in the ETA computation after updating to indicatif 0.17.5 (but the behavior wasn't exactly great before either..). It can be demonstrated with the following program (which simply ticks the progress bar count times, every five seconds):

use std::thread;
use std::time::Duration;

use indicatif::{ProgressBar, ProgressStyle};

fn main() {
    let count = 2;

    let pb = ProgressBar::new(count);
    pb.set_style(ProgressStyle::with_template("{wide_bar} ETA {eta_precise}").unwrap());
    pb.enable_steady_tick(Duration::from_millis(10));

    for _ in 0..count {
        thread::sleep(Duration::from_millis(5000));
        pb.inc(1);
    }

    pb.finish();
}

The output looks like this (ignore spacing issues, the ETA is what I'm looking at):

░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ ETA 00:00:00
██████████████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ ETA 00:00:05
██████████████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ ETA 00:00:06
██████████████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ ETA 00:00:07
██████████████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ ETA 00:00:08
██████████████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ ETA 00:00:09
████████████████████████████████████████████████████████████ ETA 00:00:00

After five seconds, the ETA jumps to 00:00:05, which is expected (there is one more iteration, the first took 5 seconds). But then, instead of continuously counting down to zero, the time goes up to almost 10 seconds, shortly before the second iteration finishes.

The behavior becomes even worse when count is larger than 2 or if the iteration time is larger.

This looks very related to #556. I wasn't sure if I should comment there, so feel free to close this as a duplicate if it's the exact same thing.

See also: previous issues reported by me with the same use case, both resolved at the time: #122 #81.

afontenot commented 1 year ago

This seems like a clear case where enable_steady_tick is giving unwanted behavior, so I would suggest just disabling that.

The issue you're seeing here is that indicatif heavily prioritizes data from the most recent 15 seconds, under the assumption that for most use cases changes that happen on order of 15 seconds are significant. When you enable the steady tick, the automatic tick revises the estimate every time, so if you go 5 seconds without any new data that's a fairly significant event as far as the algorithm is concerned. Making improvements to this approach is https://github.com/console-rs/indicatif/issues/556 but I think you'd be fine without a steady tick here.

Incidentally, indicatif makes the assumption that steps (the thing you increment) are the basic unit of progress. So until a step occurs, indicatif assumes that progress is stalled, meaning that an increased estimate makes sense. In your case this isn't really true, progress is constantly happening but you only add a step when a benchmark is completed. Indicatif doesn't try to guess how much progress is "secretly" happening in the background in between steps, it just assumes no progress has happened. For e.g. network transfers of bytes, this is a safe assumption. For other use cases, maybe not.

sharkdp commented 1 year ago

This seems like a clear case where enable_steady_tick is giving unwanted behavior, so I would suggest just disabling that.

Then how do I show up-to-date ETA information to the user? How do I indicate progress by updating the spinner (not included in the minimal example above)?

The issue you're seeing here is that indicatif heavily prioritizes data from the most recent 15 seconds, under the assumption that for most use cases changes that happen on order of 15 seconds are significant.

To be honest, it feels wrong to me to have a hard-coded time-scale in a general-purpose progress bar library.

so if you go 5 seconds without any new data that's a fairly significant event as far as the algorithm is concerned

Why? Is indicatif a general-purpose library or is it (now?) only meant to be used for use cases where progress is being made much faster than 5 seconds?

Incidentally, indicatif makes the assumption that steps (the thing you increment) are the basic unit of progress. So until a step occurs, indicatif assumes that progress is stalled, meaning that an increased estimate makes sense.

This seems like a very strong assumption. Isn't "we will advance the progress bar at approximately equidistant time intervals" the most common use case? In that scenario, it would be more reasonable to assume that inc calls will happen with a regular frequency. And progress is being made in between those inc calls. It's just impossible for the application to report it in a more fine-granular way.

For e.g. network transfers of bytes, this is a safe assumption. For other use cases, maybe not.

Maybe that's just me, but "network transfer of bytes" with irregularly appearing stalls of progress seems like a much more exotic use case compared to: we increment the progress at a steady frequency. It would be sad if indicatif would not support that use case (anymore).

djc commented 1 year ago

I think this is effectively a duplicate of #556. I would like us to do better here but it's been hard to prioritize for me. I agree the steady ticker should make sense here and we should probably tweak our ETA calculation again. It's just hard to come up with good heuristics to cover all the different use cases -- any help would be appreciated.

sharkdp commented 1 year ago

It's just hard to come up with good heuristics to cover all the different use cases

I understand that you would like to avoid adding new parameters if possible. But I think it's not unreasonable to have a EtaHint/EtaPolicy/… parameter that would help with making ETA computation better. Maybe something like

/// Set this parameter to improve the ETA calculation by providing
/// prior knowledge about the expected timing of `.inc` calls.
enum EtaHint {
    /// Progress is expected to happen at a roughly constant frequency, e.g. every X milliseconds.
    /// If the (rough) duration between `.inc` calls is known, it can be specified to even get ETA
    /// estimates before the first `.inc` call.
    Regular(Option<Duration>),

    /// Progress can be erratic (e.g. transfer of bytes over a network connection). Assume that
    /// progress is currently stalled if no `.inc` has happened within the specified time duration.
    Erratic(Duration),
}

djc commented 1 year ago

I would like to see more proof that we can't come up with a better heuristic before we add configuration for this. It's easy for everyone to suggest configuration that serves their own use cases but overall worse for the ergonomics of using the crate.

sharkdp commented 1 year ago

I would like to see more proof that we can't come up with a better heuristic before we add configuration for this. It's easy for everyone to suggest configuration that serves their own use cases

Fair point. I'm not familiar with the math/algorithm behind the ETA estimation. But I would argue that anything that involves a concrete, hard-coded time scale (the 15 seconds mentioned above) will turn indicatif into a library that is only suitable for a subset of use-cases (for the current algorithm, it only works for use cases where updates come in much faster than the hard-coded time scale).

So if you want to find an general-purpose algorithm without any parameters that supports all possible use cases, it can not involve a hard-coded time scale. Instead, it can only look at the timing deltas between .inc calls and make its prediction from there.

Note: I'm not saying that an estimator with a hard-coded time scale is a bad idea. But it can't solve all use cases (without being parametrizable) because it always breaks a (scaling) symmetry in the time dimension.

but overall worse for the ergonomics of using the crate.

I don't know. An additional parameter is always a liability in terms of API stability (and increases code complexity), but I don't think it necessarily makes this crate more complex to use. There could still be a reasonable default setting.

nagos commented 1 year ago

for the current algorithm, it only works for use cases where updates come in much faster than the hard-coded time scale)

I came across this bug in latest version hyperfine (benchmarking tool). With long cycle tymes (60+ seconds) 10 minutes ETA goes up to several days. After each tick it jumps to reasonable value.

This example goes up to 30 days ETA.

use indicatif::{ProgressBar, ProgressStyle};

fn main() {
    let cnt = 10;
    let style = ProgressStyle::default_spinner()
        .template("{pos} ETA {eta_precise}")
        .expect("no template error");
    let bar = ProgressBar::new(cnt);
    bar.set_style(style);
    bar.enable_steady_tick(std::time::Duration::from_millis(80));

    for _ in 0..cnt {
        bar.inc(1);
        std::thread::sleep(std::time::Duration::from_secs(60));
    }
    bar.finish();
}

2 ETA 21d 00:48:09

This behavior was introdused in 011c998820a5e7f2a929c18b0aa882491b05d579

alexheretic commented 10 months ago

Here is another example, video encoding. The eta is 13h then increases up to 8d until ticking back to 13h and repeating.

long-encode.webm

With set_position logging, we can see the eta increases while the progress integer remains the same and ticks back when the integer increments.

long-encode-debug-logging.webm

djc commented 10 months ago

Yes, I don't think we need more evidence that there is an issue, just someone willing to invest effort to research making it better. Unfortunately I'm unlikely to have time to do that anytime soon.

I would like to see more proof that we can't come up with a better heuristic before we add configuration for this. It's easy for everyone to suggest configuration that serves their own use cases but overall worse for the ergonomics of using the crate.

nagos commented 10 months ago

Ok. How about we just revert it, untill better solution found.

PR #620

afontenot commented 3 months ago

I did some thinking about this issue today.

The primary thing I don't want to break is stall detection. The estimated time remaining should approach infinity exponentially if no updates are received indefinitely; that is a basic and correct behavior of the EWA. (If you unplug your modem, a data transfer will never complete.)

So I decided to try the following idea: keep an unbiased estimate of the maximum delay between updates, and set the weight to a multiple of that value when it exceeds the default weighting (15 seconds). I chose the multiple 6.6 because that means that if progress stalls for 2x the estimated maximum delay between updates, the ETA will double. This strikes a good balance between fluctuating too much and requiring too long to report a stall, IMO. *

Obviously this is still not going to be ideal for everyone. Something that's very important to emphasize is that if you don't want live updates on your ETA, which necessarily increase when progress isn't made, you should not be using enable_steady_tick. You should be sending your own ticks when you have updated data instead.

The one downside of this approach I can think of is that it seriously complicates the estimate from a mathematical perspective. If we kept every single tick stored internally, we could correctly reweight all the old data whenever the tuning parameter gets changed, but we don't (and can't) store all that data. So with this change indicatif will effectively be using old data with the "wrong" weights in its new estimates, so I expect estimates to only asymptotically approach correct behavior. It's possible that pathological behavior could occur when progress occurs in unexpected patterns - test cases are definitely needed here.

Please check out my branch and run it with your use cases and let me know if you can find behavior that seems completely unreasonable, especially if the same test case works fine with the current indicatif release.

https://github.com/afontenot/indicatif/tree/dynamic-estimator-reweighting

* There's an unfortunate corner case. If your Internet goes down for 12 hours, then comes back and the transfer resumes, a new stall won't be adequately reported in the estimate unless progress stops for a day or more. I don't know that this is "wrong" exactly, but it's potentially frustrating.

djc commented 3 months ago

@afontenot sounds promising! Would be great to get more feedback from people who are suffering from this issue.

nagos commented 3 months ago

The primary thing I don't want to break is stall detection. The estimated time remaining should approach infinity

Maybe this problem could be resolved with a head-on approach. Implement stall detection, and replace ETA with "--:--" when triggered. Possible conditions:

some multiple of average update, maybe 5x
timeout

console-rs / indicatif

ETA times go up instead of down #580