Open deepfire opened 2 years ago
run 2022-06-01-18.53.1.35.0.9823.k51-5ep-360kTx-4000kU-1000kD-64kbs
on AWS
~The logic of error here is that we're seeing a LOTraceLeadershipDecided
before LOTraceStartLeadershipCheck
-- which indeed is weird, if true and if locli
didn't trip over itself in some way.~
I've misread the basic message, see down below.
Whether it's true or not, can be determined from the source of truth -- the logs from the particular node -- however the error message is unhepfully not mentioning the particular node -- so we'll need to fix the error reporting here.
Ok, so we've ran the new locli
on an old run, which we're 100% sure has all log messages proper -- and we've got the same failure.
So it is a regression in legacy trace handling.
Ok, giving --dump-logobjects
to wb a
will give us insight on how locli
views the stream of logobjects:
wb --cls a --filters size-small --dump-logobjects std 2022-05-09-15.24.1.34.1.36a3.k51-7ep-14kTx-4000kU-1000kD-73kbs-12MUTx-10BStTx-50MUBk-40BStBk-1i-1o--null
Ok, so we know it's the performance-analysis
portion of the analysis that fails (aka MachPerf
in haskell speak).
Next thing -- iterating over all the gigabytes of logs is slow, and we know the host, whose logs trigger this, so we now have a new subcommand -- performance-single-host
(or perf-single
for short):
wb --cls a --filters size-small perf-single node-0 2022-05-09-15.24.1.34.1.36a3.k51-7ep-14kTx-4000kU-1000kD-73kbs-12MUTx-10BStTx-50MUBk-40BStBk-1i-1o--null
What does the error message say?
LeadDecided for noncurrent slot=SlotNo 4068 cur=SlotNo 4069
What does it mean, on the most basic level?
It means, that we've encountered a LOTraceLeadershipDecided
with a slot
number that is behind the analysis tip (cur
)).
Hmm, so the handler for LOTraceLeadershipDecided
expects to be operating on tip always.
Is that a reasonable assumption?
No, it's not -- there can be things that legitimately progress analysis -- and break this assumption:
Of this, the second one is not the culprit -- LOTraceStartLeadershipCheck
always happens before LOTraceLeadershipDecided
for any given slot.
Hence, it must be the first one, for LOResources
-- and those are indeed asynchronous -- there is no ordering wrt any other messages.
Ok, this fiasco with node-51
had another consequence -- crazy stats:
$ jq -C '.sResourceCDFs | .[] | select(objects) | select(.cdfAverage > 100000) | { cdfAverage: .cdfAverage, loc: input_filename }' logs-node-*.flt.perf-stats.json
{
"cdfAverage": 261244935.45651916,
"loc": "logs-node-51.flt.perf-stats.json"
}
{
"cdfAverage": 3878145198.9159274,
"loc": "logs-node-51.flt.perf-stats.json"
}
{
"cdfAverage": 3878145168.1841264,
"loc": "logs-node-51.flt.perf-stats.json"
}
{
"cdfAverage": 3878145153.52692,
"loc": "logs-node-51.flt.perf-stats.json"
}
{
"cdfAverage": 273934423429010.34,
"loc": "logs-node-51.flt.perf-stats.json"
}
{
"cdfAverage": 273934423429033.2,
"loc": "logs-node-51.flt.perf-stats.json"
}
``
The easy solution is to just exclude node-51
from machine performance analysis, only keeping its BlockProp
contribution -- without which we can't reconstruct the blockchain.
This would need some analyse.sh
extension -- since the same data files are fed into the same processing anaconda..