IntersectMBO / cardano-node

The core component that is used to participate in a Cardano decentralised blockchain.
https://cardano.org
Apache License 2.0
3.07k stars 721 forks source link

[BUG] - missing resource isolation to prefer critical tasks over query processing #5984

Open andrejpodzimek opened 2 months ago

andrejpodzimek commented 2 months ago

Internal/External External

Area Other

Summary Leader log queries impede critical validator processing and cause extreme numbers of missed slot leader checks.

Steps to reproduce

  1. Watch the frequency of missed slot leader checks over time.
  2. Run a demanding cardano-cli query in a loop against the validator (example below).
  3. Watch the disaster unfold: In my case, there were 7% of missed slot leader checks due to a repeated query.
  4. Repeat the test with a regular relay node. Tip differences will run sky high (>100) when queries are processed.

Expected behavior Proper resource isolation.

System info (please complete the following information):

Screenshots and attachments An example query to expose resource isolation problems:

cardano-cli query leadership-schedule \
  --socket-path /run/cardano-validator/socket \
  --genesis config/mainnet-shelley-genesis.json \
  --mainnet \
  --vrf-signing-key-file keys/mainnet/vrf.skey \
  --stake-pool-id ... \
  --next

RTS options:

... +RTS -N -A64m -H -Iw59 --nonmoving-gc -RTS ...

Additional context This case could be dismissed with “use a workaround”, i.e. “have a separate relay node for slot leader queries only”, i.e. not for routing to a validator. However, such an idea is suboptimal, increasing the amount of resources a pool operator must set aside by up to 50%, compared to the simplest relay + validator setup.

The lack of proper resource isolation may have been a contributing factor to my problem of never successfully validating a block, described in this post and above.

github-actions[bot] commented 1 month ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 120 days.

github-actions[bot] commented 4 days ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 120 days.

karknu commented 3 days ago

@andrejpodzimek I've been working on something that may alleviate your problem. It was done for relays serving hundreds of clients but perhaps it could work here too.

https://github.com/IntersectMBO/cardano-node/tree/karknu/thread_isolation , based on 10.1.2 so will require a chain replay if you're still on 9.2.1. Experimental so best to test it on your backup BP or on a testnet.