IntersectMBO / cardano-node

The core component that is used to participate in a Cardano decentralised blockchain.
https://cardano.org
Apache License 2.0
3.05k stars 721 forks source link

[BUG] Mainnet nodes with incoming connections unexpected shutdown with Failure in Data.Map.balanceR (error, called at src/Data/Map/Internal.hs:4157:30 in containers-0.6.5.1-EiES0HFUZ8PBGNrpVjoYRF:Data.Map.Internal) #4826

Closed SmaugPool closed 1 year ago

SmaugPool commented 1 year ago

External External

Area cardano-node exception leading to shutdown

Summary All my public nodes with incoming connections (active block producer and relays) shut down unexpectedly on 2022 January 22 at 00:09:01 UTC following the following fatal exception:

janv. 22 01:09:01 de cardano-node-1.35.4[24590]: {"app":[],"at":"2023-01-22T00:09:01.24Z","data":{"immtip":{"headerHash":"fae2d44113d1fb8b99171cc64983ff5dfb92f1d360f2460e6>
janv. 22 01:09:01 de cardano-node-1.35.4[24590]: cardano-node-1.35.4: ExceptionInLinkedThread "ThreadId 165" Failure in Data.Map.balanceR
janv. 22 01:09:01 de cardano-node-1.35.4[24590]: CallStack (from HasCallStack):
janv. 22 01:09:01 de cardano-node-1.35.4[24590]:   error, called at src/Data/Map/Internal.hs:4157:30 in containers-0.6.5.1-EiES0HFUZ8PBGNrpVjoYRF:Data.Map.Internal

(time in UTC+1 above)

Some other operators confirmed this also happened to some of their nodes.

Steps to reproduce No idea yet.

Expected behavior No exception is expected in cardano-node.

System info (please complete the following information):

Screenshots and attachments Previous logs (UTC+1):

janv. 22 01:09:01 de cardano-node-1.35.4[24590]: {"app":[],"at":"2023-01-22T00:09:01.22Z","data":{"block":"77e81e9305859032fea6a8e79f5936e96723a1620608d2bd0100c2421f5ef667@82779850","kind":"TraceAddBlockEvent.AddBlockValidation.ValidCandidate"},"env":"1.35.4:ebc7b","host":"de","loc":null,"msg":"","ns":["cardano.node.ChainDB"],"pid":"24590","sev":"Info","thread":"157"}
janv. 22 01:09:01 de cardano-node-1.35.4[24590]: {"app":[],"at":"2023-01-22T00:09:01.22Z","data":{"chainLengthDelta":1,"kind":"TraceAddBlockEvent.AddedToCurrentChain","newtip":"77e81e9305859032fea6a8e79f5936e96723a1620608d2bd0100c2421f5ef667@82779850"},"env":"1.35.4:ebc7b","host":"de","loc":null,"msg":"","ns":["cardano.node.ChainDB"],"pid":"24590","sev":"Notice","thread":"157"}
janv. 22 01:09:01 de cardano-node-1.35.4[24590]: Shutting down..

Additional context All my nodes without incoming connections (just outgoing ones) were not affected.

treeowl commented 1 year ago

You used containers internals all over the place and you're complaining upstream that you got an error message? You should be grateful you didn't just silently get wrong answers. Unless proven otherwise, upstream says "You broke it and you get to keep both pieces." Upstream suggests you improve coverage in your property tests for the functions you wrote. Consider using the sort of machinery containers uses to produce sets with a wide variety of tree shapes—I don't think the Arbitrary instance you get from QuickCheck goes to the trouble, because it's assuming you're not monkeying with internals.

simonmichael commented 1 year ago

I'm not the cardano team and not complaining. :) Just interested in improving Haskell libs and trying to ask useful questions.

treeowl commented 1 year ago

@simonmichael Glad to hear. It's most certainly possible that this is caused by a containers bug, but I was a bit upset to see what seemed to be the assumption that it is under the circumstances.

sakakibaraakio commented 1 year ago

Is the new node version 1.35.5 meant to fix this bug?

Straightpool commented 1 year ago

Is the new node version 1.35.5 meant to fix this bug?

Yes, go update your node to 1.35.5.

daehan-koreapool commented 1 year ago

Could anyone please share which Github merge request fixed this issue? I'd like to understand how it's been fixed so that I can explain it to our Korean community members.

rdlrt commented 1 year ago

I'd like to understand how it's been fixed so that I can explain it to our Korean community members.

Please wait for the blog post that will be released in ~6-8 weeks with details

treeowl commented 1 year ago

@daehan-koreapool

Could anyone please share which Github merge request fixed this issue? I'd like to understand how it's been fixed so that I can explain it to our Korean community members.

Yes, as you (?) pointed out in a now-deleted comment, the fix is in this PR.

The code was for a function called canonicalInsert, whatever that means. It's mostly copied from the standard insert code, but around line 91, it said

      case compare kx ky of
        LT -> balanceL ky y (go f kx x l) r
        GT -> balanceR ky y l (go f kx x r)
        EQ -> if new == zeroC then link2 l r else Bin sy kx new l r

The trouble is in the new == zeroC = True case. Here the element that used to be in the set is deleted. That's problematic because balanceL and balanceR are each designed to balance in one direction (for insert, assuming the only possible balance change was an element inserted in the subtree in question), but this function can either insert or delete. containers used to have an internal balance function that would be appropriate here, but it was removed because it's no longer used there. The Cardano folks could rescue it from the git history to use if desired.

ilap commented 1 year ago

@simonmichael Glad to hear. It's most certainly possible that this is caused by a containers bug, but I was a bit upset to see what seemed to be the assumption that it is under the circumstances.

The cause of the issue was known for some days and fixed, though any ppl from anywhere who have git account and have wrong assumption can raise a non valid issue in any repo.

But reacting like you did (assuming that the mentioned Cardano folks blame you), imo is not really distinguishing you from those who have strong opinion without validation. It's just my 2 cents. In the future do some fact checks first, before you make any firm statements.

erikd commented 1 year ago

@simonmichael This was due to the Cardano devs using a function in containers incorrectly (for performance reasons).

Until a larger portion of the network is using the fixed version (currently its at 21% according to Pooltool) we would like to keep public knowledge of this issue as vague as possible. Once the network is safe we will be as transparent as possible about what happened.

nompelis commented 1 year ago

Until a larger portion of the network is using the fixed version (currently its at 21% according to Pooltool) we would like to keep public knowledge of this issue as vague as possible. Once the network is safe we will be as transparent as possible about what happened.

Why? This is a double-edge sword to attempt to conceal problems, and it will damage the image of the community and the platform. If it is a potential "attack vector", you keep developers "up at night" to fix it, but obfuscating facts will always come back and byte you.

I do not want this to become a controversial topic, and also to fill people's emails with garbage. Just raising a point on trnsparency and principles. Peace out.

Quantumplation commented 1 year ago

I don't think this is an attempt to "conceal" problems; it's IOG exercising their responsible disclosure policy. The plan is to release a detailed post-mortem, but doing so before a significant number of SPOs have upgraded is just adding unnecessary risk that someone malicious will be able to reproduce it.

IOG actually went above and beyond to include members of the community in this effort for the first time, for which they should be applauded. Waiting a few weeks to release a post-mortem is standard security practice, and is going to make very little difference for any benign parties who could benefit from a deep understanding of the attack, but makes a huge difference for any potential malicious actors out there.

woodkm commented 1 year ago

Agreed with Pi.

Just wanted to add my own thought (though it has been covered already). The process by which IOG is following is standard practice in Security. It is a. If you reveal too much too soon, you are making it easier for bad actors to gain the information needed to use to their advantage, prior to the resolution being widely implemented.

Transparency does not mean haphazardly prematurely releasing information that would be a benefit to bad actors, prior to the risk being mitigated to a manageable/reasonable degree.

1) Identify bug/issue/problem. 2) Create and release fix (short term, long term, or both). Depending on the nature, difficulty, and more. 3) Allow the entities involved an appropriate amount of time to implement the resolution, usually based on a percentage coverage and/or high impact system coverage. That sorts of metrics. 4) Release Post Mortem - This can contain why and what happened, how to prevent it immediately, how to prevent long term, any other changes to prevent similar future issues, things like that.

Additionally, if curious, here are the general stage flows of an Incident Response (from my experience in Ransomware Forensic Cases). They vary, in general these are the steps. Preventative (standalone phase) Detection and Analysis Containment Eradication Recovery Post-Mortem

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 120 days.

erikd commented 1 year ago

This issue has been fixed but will remain open until the problem, its cause and its solution has been documented and made public.

treeowl commented 1 year ago

I've opened https://github.com/nick8325/quickcheck/issues/354 to help address the root cause.

JaredCorduan commented 1 year ago

My sincere thanks goes to everyone who helped resolve this issue. @lehins and @TimSheard were true heroes in resolving it fast. @Quantumplation and @AndrewWestberg were also heroes, it's always a real joy to work with y'all.

The issue was resolved by https://github.com/input-output-hk/cardano-ledger/pull/3343.

@kevinhammond published a brief incident report: https://input-output-hk.github.io/cardano-updates/2023-04-17-ledger

@Quantumplation published a well written and expository postmortem: https://www.314pool.com/post/cardano-post-mortem-1