IntersectMBO / cardano-node

The core component that is used to participate in a Cardano decentralised blockchain.
https://cardano.org
Apache License 2.0
3.07k stars 720 forks source link

[BUG] - cardano-testnet sometimes hangs indefinitely #5762

Open carbolymer opened 7 months ago

carbolymer commented 7 months ago

Internal/External Internal if an IOHK staff member.

Area

Other Any other topic (Delegation, Ranking, ...).

Summary Sometimes a cardano-testnet test suite hangs indefinitely. It's like nodes are taking longer time to produce blocks. It may be related to what @james-iohk described here https://github.com/IntersectMBO/cardano-node/pull/5679#issuecomment-1959419678

The issue is more visible in slower machines, like macos runner in GHA or darwin cross-compilation in Hydra.

Steps to reproduce Steps to reproduce the behavior:

  1. Rerun the cardano-test suite multiple times, some of the tests should either get stuck or fail on a condition check in byDeadlineM.

The issue appears to appear more frequently when running testnet test suites in parallel.

[!NOTE] Testnet tests can be executed in parallel using PARALLEL_TESTNETS=1 environment variable or by setting --test-options '--num-threads 8' in cabal test cardano-testnet execution (after that PR gets merged).

Sample log of a failure: babbagetransaction.txt (taken from: https://github.com/IntersectMBO/cardano-node/pull/5695/checks?check_run_id=22357754517)

Expected behavior cardano-testnet does not hang, or retries, reports the failure with message explaining what happened.

carbolymer commented 7 months ago

Initially, byDeadlineM usage was considered an issue here, which was partially removed in https://github.com/IntersectMBO/cardano-node/pull/5707 - but instead of test failures we started getting cardano-tesnet freezes. A suspicion here is that the test network is not advancing - the new blocks are not produced.

github-actions[bot] commented 6 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 120 days.

github-actions[bot] commented 5 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 120 days.

github-actions[bot] commented 4 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 120 days.

carbolymer commented 2 months ago

Some stability window discussion (internal link): https://docs.google.com/document/d/1B8BNMx8jVWRjYiUBOaI3jfZ7dQNvNTSDODvT5iOuYCU/edit#heading=h.qh2zcajmu6hm

Consensus docs: https://ouroboros-consensus.cardano.intersectmbo.org/docs/for-developers/Glossary#epoch-structure