[test request] tipset assembly in real-world network conditions (monte carlo simulation)

filecoin-project / oni

👹 (DEPRECATED; see README) Project Oni | Network Validation

https://docs.google.com/document/d/16jYL--EWYpJhxT9bakYq7ZBGLQ9SB940Wd1lTDOAbNE

7 stars 5 forks source link

[test request] tipset assembly in real-world network conditions (monte carlo simulation) #14

Open raulk opened 4 years ago

raulk commented 4 years ago

What would you like us to test?

Tipset assembly in real-world network conditions such as delays, packet loss, packet duplication, jitter, latencies between miners.

Technical implementation details.

Possibly long-running tests using some Monte Carlo simulation model. Need to define what metrics to gather from the system as the tests run. I envision this as a long-running loop that deploys, for example, 50 lotus instances.

For each iteration:

decide how many instances participate.
produce the genesis block.
mine, mine, mine.
stochastically vary network conditions continuously.
mine, mine, mine.
capture consensus metrics from all instances.

What should we measure?

TBD.

Which components are involved?

TBD.

On a scale from 0-10, what's the proposed _discomfort factor_? In other words, how uncomfortable would you be if we went live without having tested this? Explain why.

TBD.

Additional remarks.

TBD.

Requestor: @magik6k.

raulk commented 4 years ago

@yusefnapora and I just discussed what a good charting/graph approach would be. The batch runner allow us to run lots of simulations, one after another, with random parameters. If we run these overnight, we'll wake up to hundreds of network simulations we need to quickly make sense of by paging quickly through results.

We came up with this:

simple line chart, with one series per miner.
x axis is test time (natural time); y axis is chain height.
subscribe to head changes on each miner, and record every single event that happens (can also do with the in-memory journal I'm wrapping up in https://github.com/filecoin-project/lotus/pull/2455).
reverts and applies are points on the graph (t, height).
a perfectly behaved scenario (ever increasing applies) would display a steadily increasing line, possibly at 45 degrees (depending on how we scale the axes).
reverts would be charted as (t, height) points. a revert pulls back the head, so the y value would decrease at the same time t.
visually, the pattern we'll see is downward blips. these are indicative of corrective chain reorgs.
another one worth charting somehow is chain weight.
we expect that lagging nodes due to latency, jitter, etc. will show as lagging in the chart as well.
as a subtitle at the top of the chart, indicate if the assertion that all heads are equal at the end of the test is met (green text = yes, red text = no).

yusefnapora commented 4 years ago

@raulk here's what I've got so far:

chain height with reverts

The "effective height" takes revert operations into account, so the little downward blips are reverts followed by applies. I was interested to learn that there are several revert / apply operations for normal fast-forwards, but it makes sense once you see them. If we have a tipset with a single block and get another valid block to include, we revert the single-block tipset and apply the new one. So there's a "revert blip" for every tipset with multiple blocks.

Next step is to combine this graph across all the test participants; so far I've just been working with a single miner for simplicity, but it's collecting data from everyone.

After that, we can see how it looks with weird network conditions :)

raulk commented 4 years ago

Nice, this is a great start! 😍 In fact, the 46s mark is showing a slightly different pattern than the rest (the valley is a bit wider). The downward blips are number of revert operations? I wonder if we can find a way to draw both: number of revert operations, AND the heights that were reverted, AND the unique number of block CIDs seen at that height?

Rationale: if we are reverting a tipset with block B1 at height N to replace it with a tipset with B1,B2 at height N, to then replace it with a tipset with B1,B2,B3 at height N => this would be ordinary behaviour. And what we want to capture there is the time it took before we advanced (the width of the valley, as you are showing here).

And a fork should look completely different, I guess.

raulk commented 4 years ago

We can definitely get rolling with this, though. Let's start running the batch jobs and collecting the raw data! In parallel, we can fine-tune the visualisations while those jobs are running.

raulk commented 4 years ago

Here's a pretty poor sketch with some further ideas.

line looks like a staircase => helps us grok the length of an epoch at plain sight.
vertical hairline strokes at each height represent a unique block seen at that height => this helps us grasp the delay between blocks received at that height.
if the chain regresses (there's revert to a tipset at a height lower than our current height), we go back as many steps, and chart any new blocks we see at that height, then continue charting the applies (this would probably get SUPER MESSY as they happen very quick and they would be super compressed in time...)
such an event happens in the example I pasted above. We advance from block height 5 to 6, then briefly regress to 6 and receive an extra block.
the length of the vertical hairline strokes is the accumulated number of unique blocks seen at that height (you expect it to get longer within an epoch).

^^ take all of this as creative input, not as instructions ;-)

yusefnapora commented 4 years ago

@raulk this is great stuff, thanks :)

Good eye on the different pattern at 46s - that was a tipset with three blocks, so there were two reversion. You can see if you zoom in a bit:

chain height, multiple reverts

I think something like your stair-step graph is possible with the chart libs I'm using, but I'll need to dig in a bit more.