Improve sync progress status display (mainly TUI)

tleedjarv commented 1 year ago

Motivation

There are more details in commit messages and code comments, but the basic premise is that the current ETA calculation is just not good enough, for various underlying reasons. To improve calculation of ETA, we need to improve calculation of the progress rate.

The nature of actions taken during a sync makes the rate of progress highly volatile and this is not something that can be easily changed (but maybe improved upon in future). This PR focuses on trying to get reasonably usable results out of the volatile data. In other words, it's mostly about algorithms.

Description

Two new progress rates are calculated: a less stable one which is suitable for displaying to users and a more stable one which is suitable for ETA calculation. The less stable rate is calculated as an exponential moving average (aka exponential smoothing) of the actual sync rate (which is simply calculated as delta synced bytes over delta time). The more stable rate is calculated as an exponential moving average of the former rate (double smoothing) with an additional Gaussian filter to exclude tiny changes. I've done countless tests to end up with these algorithms and these constant values. It's impractical to go any further without an automated setup... and I'm not going to do that (just yet).

The ETA itself is no longer calculated to a resolution of 1 second (except for ETA under 2 minutes) because even with very stable calculated rate, ETA for longer syncs can change several (tens of) seconds for just a tiny rate change.

Further details are included in collapsed sections.

Open this section for detailed motivation with illustrations...

_But is the current state really that bad?_ Yes, it is. This is a graph of a real sync, showing the actual rate and the rate used for ETA calculation. Here and onwards, all graphs represent real syncs not simulations, x-axis is always in seconds, y-axis is in bytes/s, and "Current rate" indicates the actual "raw" sync rate at any moment of time. ![image](https://user-images.githubusercontent.com/69477666/231439139-e5b2f98b-43f1-4a9f-8c25-848bab99bc53.png) _But this isn't so bad..._ Sure it is, let's just zoom in. ![image](https://user-images.githubusercontent.com/69477666/231439896-fa0405f5-92ac-467f-860c-ed9eb1079807.png) The ETA is calculated at 2-3x times the actual rate. A 3 minute sync is displayed as 1 minute ETA. _Does it really matter?_ Yes, it does. The examples here are of syncs lasting a few minutes only. It gets much-much worse for syncs lasting hours. _Are the new algorithms any better?_ Yes, this is the same graph with the ETA rate calculated by the new algorithm. ![image](https://user-images.githubusercontent.com/69477666/231440989-a13d56ce-b6f4-4802-ab17-3c45049cbc48.png) And zoomed in... ![image](https://user-images.githubusercontent.com/69477666/231441211-50c9068d-0bbc-465f-83d9-5159193b0ff7.png) _Ok, so that does look better. But is the code complexity worth it?_ I believe it is (but simpler code achieving the same or better results is even better!). This is a real-world example of raw input data: ![image](https://user-images.githubusercontent.com/69477666/231443424-cc60fb75-141c-445b-976c-f8b038e174fd.png) Looks like the sync rate is zero most of the time? But it really isn't; look at the scale of y-axis! Zooming in, we see: ![image](https://user-images.githubusercontent.com/69477666/231443695-85c2eda9-91bb-48d2-8e5e-cd668862abeb.png) Simpler algorithms are simply unable to smooth out these fluctuations while at the same time remaining capable of responding to the huge changes (between ~1MB/s to tens and hundreds of MB/s). This is what a simple moving average would look like when zoomed in (the plot lines are thinner to better show detail). This is not good enough for ETA calculation. ![image](https://user-images.githubusercontent.com/69477666/231445230-f644b0e1-8974-4f2a-a645-f50d0bee1ca8.png) The most obvious downside of smoothing the rate for ETA calculation is that actual average rate changes (not just fluctuations) are also smoothed out. This can be seen by looking at the huge changes in the current rate on graphs above. The actual rate change looks like a right angle; just look at how long it takes for the average rate to catch up to that change. The real trick is finding the balance between stability and responsiveness. ETA calculation requires _very_ stable input.

Open this section for more illustrations...

uigtk already had implemented a kind of exponential rate smoothing algorithm. But it was not used for calculating the ETA because it was still much too volatile. Here are some comparisons. Zoomed out: ![image](https://user-images.githubusercontent.com/69477666/231447816-c02fce3b-7ae6-4731-968f-4cb8459f52ad.png) And zoomed in: ![image](https://user-images.githubusercontent.com/69477666/231447943-7e07be88-b704-4e32-8c8d-4cd0b7f0bce3.png) The zoomed in section also illustrates well why there is a need for double smoothing. It's not always fun and games, though. Here, the algorithm just can't produce a stable rate because the volatility in input data is too much. Note that in this case the old ETA algorithm would possibly produce a better result, right up until the average rate changes around 225 seconds mark. ![image](https://user-images.githubusercontent.com/69477666/231449869-a494119c-ad62-4695-b39c-5cfdd4c16942.png) Here, the old ETA calculation seems to be more stable: ![image](https://user-images.githubusercontent.com/69477666/231452807-fa76c605-9791-4d70-9655-e0f21d13fe9b.png) Another example of the extreme rate volatility. Note the scale on y-axis. ![image](https://user-images.githubusercontent.com/69477666/231451170-d79c16db-4098-42fb-845f-d2939a3acce4.png) ![image](https://user-images.githubusercontent.com/69477666/231453533-507be593-5313-4737-a325-ab1920df2b7f.png) Section of previous graph zoomed in, comparing simple moving average (avg1) to the double-smoothed rate (note that the legend has changed): ![image](https://user-images.githubusercontent.com/69477666/231454474-2d566f7e-4098-4f32-8800-e514cf262aaa.png)

There's more

This PR also in a very minor way improves #816 (I didn't see a reason to split it out into a separate PR -- but it can be easily done). The status display in TUI now shows the number of updates in addition to bytes.

Overall, the status display in TUI has become rather busy with this. For short syncs this could be distracting. For longer syncs, I consider the additional info valuable. It now looks like this:

 20%  0/2  (11.6 MiB of 57.1 MiB)   6.1 MiB/s    00:00:05 ETA

("0/2" indicates number of completed and total items). Just recently the status display looked like this:

 20%  00:05 ETA

acolomb commented 1 year ago

Very nice work indeed and a big thumbs up for tackling this seemingly unimportant detail which can however be a real PITA when waiting for a sync to finish before leaving work to catch a train (based on actual experience). And you did a great job measuring, understanding and verifying the situation and your improvements, kudos!

Just a thought I had been pondering since many years back, which might serve well here as an inspiration for even better algorithms.

You analyze three different intervals. 1. From start up to the current time. 2. The last minute. 3. The last ten seconds.

For each interval, analyze the regularly sampled progress rate, taking the average (mean) value as well as the standard deviation within the interval. Then pick the interval with the lowest standard deviation and use its mean value for ETA calculation.

Whether the standard deviation is the right factor to decide on and how the intervals should be tweaked is of course still to be examined, but this is the general idea. The goal is to switch between the different estimations and rely on short-term rates only when they are currently rather steady, and fall back to longer periods in case of fluctuations. Actual, longer lasting changes in progress rate will lead to the short-term recent data being preferred again.

Hope it gives you some fresh ideas to try on the data. Happy to answer any questions if I didn't word it clearly enough.

tleedjarv commented 1 year ago

Funny you should mention it because this is very similar to the concept I used years ago for my very first patch. It was simpler and very hacky, which is why I never even considered upstreaming it. But it sort of worked for me at that time and was "good enough". I never spent the time to make it universal or "correct", as at that time it clearly was not worth the effort.

I don't think I'm going to elaborate on this concept at the moment but I'm certainly open to working with someone who'd show interest in it. For anyone discouraged by the OCaml language, don't be. I'd be happy to take a proposed algorithm, pseudocode or even (pseudo)code in some other language and (re)write in OCaml.

What's currently stopping me from experimenting any further is that so far I took the hard route and actually tested all ideas and changes on real syncs. I wouldn't be surprised if (in fact, I'm pretty certain) I spent more time and effort doing that than I would have spent coding up a simulation and testing rig. But that's power of inertia for you.

acolomb commented 1 year ago

Well if you still have the raw data of your plots available, different algorithms can be tested independent of an OCaml implementation quite easily. At least for your tested scenarios. Not that I'd have time for experiments right now, but do hold on to the data for some later time maybe.

tleedjarv commented 1 year ago

Yes, experimenting would be completely independent and I encourage anyone interested in this to give it a shot in any language/environment. I do have the raw data and it's very easy for me to provide a debug build for anyone to collect and save their own raw data.

bcpierce00 / unison