IntersectMBO / cardano-node

The core component that is used to participate in a Cardano decentralised blockchain.
https://cardano.org
Apache License 2.0
3.06k stars 720 forks source link

analysis failure: `LeadDecided for noncurrent slot=SlotNo 4068 cur=SlotNo 4069` #3986

Open deepfire opened 2 years ago

deepfire commented 2 years ago
processing ../bench-1/runs/2022-06-01-18.53.1.35.0.9823.k51-5ep-360kTx-4000kU-1000kD-64kbs/analysis/logs-node-9.flt.json
{ "tip": "a1fad4dfb8fc82410026f8edd7008af0c1e32b50837689e2dd378c3a14d0f65d" }
{ "chain": "../bench-1/runs/2022-06-01-18.53.1.35.0.9823.k51-5ep-360kTx-4000kU-1000kD-64kbs/analysis/chain-raw.txt" }
{ "filtered-chain-slot-domain": {"ddFilteredCount":39953,"ddFilteredFirst":16026,"ddFilteredLast":55979,"ddRawCount":65619,"ddRawFirst":865,"ddRawLast":66484} }
{ "filtered-chain-block-domain": {"ddFilteredCount":1815,"ddFilteredFirst":739,"ddFilteredLast":2554,"ddRawCount":3074,"ddRawFirst":0,"ddRawLast":3073} }
{ "chain": "../bench-1/runs/2022-06-01-18.53.1.35.0.9823.k51-5ep-360kTx-4000kU-1000kD-64kbs/analysis/chain.txt" }

locli: LeadDecided for noncurrent slot=SlotNo 4068 cur=SlotNo 4069
CallStack (from HasCallStack):
  error, called at src/Cardano/Analysis/MachPerf.hs:343:10 in locli-1.29-inplace:Cardano.Analysis.MachPerf
real    9m8.705s
deepfire commented 2 years ago

run 2022-06-01-18.53.1.35.0.9823.k51-5ep-360kTx-4000kU-1000kD-64kbs on AWS

deepfire commented 2 years ago
  1. point of error: https://github.com/input-output-hk/cardano-node/blob/master/bench/locli/src/Cardano/Analysis/MachTimeline.hs#L324
deepfire commented 2 years ago

~The logic of error here is that we're seeing a LOTraceLeadershipDecided before LOTraceStartLeadershipCheck -- which indeed is weird, if true and if locli didn't trip over itself in some way.~

I've misread the basic message, see down below.

deepfire commented 2 years ago

Whether it's true or not, can be determined from the source of truth -- the logs from the particular node -- however the error message is unhepfully not mentioning the particular node -- so we'll need to fix the error reporting here.

deepfire commented 2 years ago

Ok, so we've ran the new locli on an old run, which we're 100% sure has all log messages proper -- and we've got the same failure.

So it is a regression in legacy trace handling.

deepfire commented 2 years ago

Ok, giving --dump-logobjects to wb a will give us insight on how locli views the stream of logobjects:

wb --cls a --filters size-small --dump-logobjects std 2022-05-09-15.24.1.34.1.36a3.k51-7ep-14kTx-4000kU-1000kD-73kbs-12MUTx-10BStTx-50MUBk-40BStBk-1i-1o--null

deepfire commented 2 years ago

Ok, so we know it's the performance-analysis portion of the analysis that fails (aka MachPerf in haskell speak).

Next thing -- iterating over all the gigabytes of logs is slow, and we know the host, whose logs trigger this, so we now have a new subcommand -- performance-single-host (or perf-single for short):

wb --cls a --filters size-small perf-single node-0 2022-05-09-15.24.1.34.1.36a3.k51-7ep-14kTx-4000kU-1000kD-73kbs-12MUTx-10BStTx-50MUBk-40BStBk-1i-1o--null

deepfire commented 2 years ago

What does the error message say?

LeadDecided for noncurrent slot=SlotNo 4068 cur=SlotNo 4069

What does it mean, on the most basic level?

It means, that we've encountered a LOTraceLeadershipDecided with a slot number that is behind the analysis tip (cur)).

Hmm, so the handler for LOTraceLeadershipDecided expects to be operating on tip always. Is that a reasonable assumption?

No, it's not -- there can be things that legitimately progress analysis -- and break this assumption:

  1. https://github.com/input-output-hk/cardano-node/blob/master/bench/locli/src/Cardano/Analysis/MachTimeline.hs#L247
  2. https://github.com/input-output-hk/cardano-node/blob/master/bench/locli/src/Cardano/Analysis/MachTimeline.hs#L257

Of this, the second one is not the culprit -- LOTraceStartLeadershipCheck always happens before LOTraceLeadershipDecided for any given slot.

Hence, it must be the first one, for LOResources -- and those are indeed asynchronous -- there is no ordering wrt any other messages.

deepfire commented 2 years ago

Ok, this fiasco with node-51 had another consequence -- crazy stats:


$ jq -C '.sResourceCDFs | .[] | select(objects) | select(.cdfAverage > 100000) | { cdfAverage: .cdfAverage, loc: input_filename }' logs-node-*.flt.perf-stats.json
{
  "cdfAverage": 261244935.45651916,
  "loc": "logs-node-51.flt.perf-stats.json"
}
{
  "cdfAverage": 3878145198.9159274,
  "loc": "logs-node-51.flt.perf-stats.json"
}
{
  "cdfAverage": 3878145168.1841264,
  "loc": "logs-node-51.flt.perf-stats.json"
}
{
  "cdfAverage": 3878145153.52692,
  "loc": "logs-node-51.flt.perf-stats.json"
}
{
  "cdfAverage": 273934423429010.34,
  "loc": "logs-node-51.flt.perf-stats.json"
}
{
  "cdfAverage": 273934423429033.2,
  "loc": "logs-node-51.flt.perf-stats.json"
}
``
deepfire commented 2 years ago

The easy solution is to just exclude node-51 from machine performance analysis, only keeping its BlockProp contribution -- without which we can't reconstruct the blockchain.

This would need some analyse.sh extension -- since the same data files are fed into the same processing anaconda..