Estimates become extremely large if progress updates are infrequent

teor2345 commented 1 year ago

We're using indicatif via howudoin, to display events that update every few minutes. Sometimes there can be delays of up to 10 minutes.

We're seeing extremely large estimates when there aren't any events for a few minutes.

This is the underlying cause of the panics in #554 in our application. There aren't any updates for a few minutes, so the estimate becomes billions of years. Eventually, it is outside the range of Duration, which panics.

Is it possible to make EXPONENTIAL_WEIGHTING_SECONDS configurable, or use an algorithm that doesn't have this exponentially increasing behaviour when there aren't any updates? (I have read the discussion in #394 and related tickets.)

Here's an example of the beginning of an exponential increase:

djc commented 1 year ago

@afontenot would be great if you have any ideas how to avoid this.

afontenot commented 12 months ago

Sure, this was something that came up in the development of the new algorithm. I had initially planned to make the behavior around this configurable in two ways (which I'll describe below), but we ended up deciding to leave it out in favor of having good defaults.

The issue here is that given the assumptions made by the algorithm, a very large ETA is entirely reasonable if no progress has occurred in e.g. 2 minutes. The weighting of the exponential function is such that the most recent 15 seconds provide most (but not all) the data in the average. The reason for this is that it's designed to be reactive on time scales that matter to a person continually watching progress - for example, on a file transfer. It's not tuned for generating good estimates for long, intermittent activities.

On a technical level, this is the result of two decisions:

The specific weighting of the algorithm (15 seconds provides 90% of the weight). This was originally going to be configurable.
The "live update" behavior, meaning that the estimate updates whenever a tick occurs regardless of whether any new progress occurred during the tick. This works out okay for progress bar consumers who want a manual tick, and works out great for the "file transfer" type use cases I mentioned, because in those cases you want a revised estimate in the event of a stall. (If the network cable got unplugged, the transfer is never going to complete.) Unfortunately, it's much less helpful in the case of a steady tick combined with intermittent progress. I believe I mentioned during development that some would find this behavior annoying and that there should probably be a setting to disable it. When you have predictable intermittent stalls, it's less annoying to just wait for progress to continue rather than having the progress rate estimate exponentially approach zero.

Of these two, I'd say the first is most directly implicated here. Even if you implemented the second feature, you'd see annoying jumps in the estimate with a progress stall of 10 minutes. The exponential smoothing that the algorithm is designed to provide would have basically no effect because the time scale is much too small.

I think it would not be unreasonable to try to make this configurable. Everything should just work if you set the value to 20 minutes or even higher. (With very high settings, there's not much down-weighting of older data, so you get behavior approximating a linear average since the beginning of progress, which is often appropriate for these "predictable intermittent stall" cases.)

teor2345 commented 12 months ago

I think it would not be unreasonable to try to make this configurable. Everything should just work if you set the value to 20 minutes or even higher.

Thanks, that would be helpful for us.

We expect progress every 75 seconds for one of our progress bars, and every 10 seconds - 3 minutes for the other.

djc commented 12 months ago

Requiring configuration for this kind of thing seems like an anti-pattern to me: requiring users to give us information that they then have to benchmark and keep up to date, when it feels like there is some algorithm we could use to avoid the current edge case behavior.

Can we, for example, define some boundary where we switch to different tuning parameters?

teor2345 commented 11 months ago

Requiring configuration for this kind of thing seems like an anti-pattern to me: requiring users to give us information that they then have to benchmark and keep up to date, when it feels like there is some algorithm we could use to avoid the current edge case behavior.

I agree.

Can we, for example, define some boundary where we switch to different tuning parameters?

Can we dynamically change the weighting based on the average/median time between the most recent N progress updates? If needed, we could exclude the last 1-2 updates, because they might represent a disconnection or other instability. (A median would do this automatically.)

This would work for us, because each of our progress bars has two different modes:

Blocks: initially multiple times per second, then every 75 seconds
Checkpoints: initially every 5-30 seconds, then it is finished
Chain Forks: initially every 75 seconds, then no updates for 7500 seconds (we disabled the ETA because it was meaningless, and we're unlikely to restore it even if the estimate is fixed)

djc commented 11 months ago

@afontenot would you be able to spend more time on this? If not, that's fine too, I can dig into it more.

SolidTux commented 8 months ago

Requiring configuration for this kind of thing seems like an anti-pattern to me: ...

I would highly appreciate it if there would be an option to turn off the exponential weighting at all. I guess if the decay rate is configurable, one could set it to a very high value as you have mentioned, but I fear that I would have to set them high enough that putting the number of seconds into an exponential could cause problems.

I have programs that run up to a few days, with steps sometimes taking hours. The steps are very consistent in length, so the exponential weighting provides no benefit at all. Also, there is no way to further subdivide the steps as the most time is spent in one call to lapack.

Without the steady tick, the elapsed time does not get updated often enough, e.g. there is no way to see, how long the program is running for before the first step is completed.

console-rs / indicatif

Estimates become extremely large if progress updates are infrequent #556