Add progress counter to prometheus output

telackey commented 1 year ago

This adds (estimated) progress counters for the top-level tracked iterators by checking the current it.Path() position vs the total distance of startPath to endPath.

dboreham commented 1 year ago

I think this is about "how good/bad is the estimate, really?". It should be possible to get some pudding proof on this, since @telackey has run the code on mainnet at block height 14M. What did the progress graph look like in that case? (big discontinuities would indicate the estimate is not good).

telackey commented 1 year ago

I think this is about "how good/bad is the estimate, really?"

Yes, and I would be full open to renaming the method "estimatePathDistance" if that would help convey the intended use. Or even "pathBreadth" as the distance is "across" rather than down.

So the goal was to provide some sort of indicator between "started" and "done", especially on a process which takes well over 24 hours. We do have some counters on the number of different node types processed, but without goalposts, those don't communicate very much info other than that the process if still doing stuff.

With the tracked iterators, we get bounded start/end paths that usually look something like:

:04
0400:08
0800:0c
0c00:

The idea at hand is that if the tree is reasonably balanced, we should be able to gauge the current iterator position, and judge how far along it is. In the simplest case of a single iterator, the range is 0x0:0xf, so if it is currently at path 0x8, you can estimate that it is 50% done. Is the necessarily true? No, because the tree may not be balanced, but it it is a reasonable guess.

Extrapolating it to the slightly more complicated example with 4 iterators above, I should be able to say that for an iterator processing the range 0400:08 if the current it.Path() is 06, that it is is also about 50% done.

The wrinkle that makes the latter comparison tricky is that the paths as we observe them don't all have the same depth (eg, 0400 vs 06), so the comparison cannot always be direct. But we can figure out the maximum number of iterations at the top-level that it would take to get from our start path (0400) to our end path (08[00]), which is 4x16 = 64, and also the maximum number if iterations it would have taken to get to our current position 0400 to 06[00] (2x16 = 32), and gauge accordingly.

dboreham commented 1 year ago

@telackey if you plot the graph of "indicated %done" vs "% of eventual elapsed time" for your 14M mainnet snapshot, is it a straight line through the origin? If not how much does it deviate from that ideal line?

telackey commented 1 year ago

I refactored the code a bit it to call it estimateSteps, as I thought that might better convey what it was doing.

I also found a bug while double checking some output to answer @dboreham 's question (basically under certain circumstances it could switch comparison "depths" for the last worker, eg from 0c00:nil to 0c:nil, and so give an apparent, but illusory, sudden leap in progress). The simplest fix for that was to make the desired comparison "depth" explicit, which also has the benefit of making the logic clearer in general.

For the question of progress, unfortunately I didn't have the elapsed time output on the last 14.5M run I did, and it would take a couple of days to run that one again, but I did some runs at block heights of 1M, 2M, and 5M. At 1M the progress was a bit jumpy (eg, more variation between the workers, and occasional sudden leaps), but at 5M it was fairly steady. At what was the actual half-way point (135 minutes) it averaged ~67% overall but a few of the workers were far back (~20%). Worker 8 showed the most "jumpy" behavior, quickly moving from 25% to 94% over the course of an hour, and worker 2 the most sluggish. It actually took the longest, taking another 135 minutes to move from 62% to 100%.

Overall though I would say it accomplishes the "reasonable estimate" goal. After it had been running for half an hour, I looked at the progress and guessed it would take it 5h. It took 4:29:15.

telackey commented 1 year ago

By the way, I made those changes before the last two comments by Ian and David, I just wanted to update the conversation.

cerc-io / ipld-eth-state-snapshot

Add progress counter to prometheus output #76