ZcashFoundation / zebra

Zcash - Financial Privacy in Rust 🦓
https://zfnd.org/zebra/
Apache License 2.0
410 stars 100 forks source link

Panic: called fail_with on already-failed connection state, from msg=GetData #1599

Closed dconnolly closed 3 years ago

dconnolly commented 3 years ago

Duplicate

This panic will be fixed by the fail_with refactor - so this ticket is a duplicate of #1610.

Motivation

These panics are related to https://github.com/ZcashFoundation/zebra/issues/1510 ~which should have been fixed with~ which was partly fixed by https://github.com/ZcashFoundation/zebra/pull/1531.

Edit: we knew #1531 fixed some potential causes of #1510, but we weren't sure if we'd found them all

Issue report 1 is from commit 8a7b023. Issue report 2 is from commit 1edb379e (#1620 + #1622). The backtraces don't contain any useful information.

Analysis 2

  1. A peer calls GetHeaders then GetData (get blocks) on the same connection.
  2. GetHeaders (or some intervening request) fails with "Connection reset by peer".
  3. Zebra panics because processing the GetData request fails with a "Broken pipe" error.

When the GetHeaders request fails, Zebra should stop processing any further requests from that peer. Instead, it continues to try to process inbound messages from that peer.

Analysis 1

Possibly resolved in #1600.

This panic can happen in two different ways:

  1. The inbound service calls drive_peer_request, which doesn't check for a failed connection state before performing tasks that can potentially call fail_with.
  2. The connection code calls async functions after checking for a failed connection state, so it's possible that the connection can fail between the check and a subsequent call to fail_with

So I suggest that we add the missing failed connection check, downgrade the panic to a warning, and add additional debugging information.

Error 1

called fail_with on already-failed connection state

Metadata 1

key value
version 1.0.0-alpha.0
git commit 8a7b023
Zcash network Mainnet
location /zebra/zebra-network/src/peer/connection.rs:515:13

SpanTrace 1

``` SpanTrace: 0: zebra_network::peer::connection::msg_as_req with msg=getdata at zebra-network/src/peer/connection.rs:728 1: zebra_network::peer::handshake::peer with addr=218.8.47.71:8233 at zebra-network/src/peer/handshake.rs:198 2: zebrad::application:: with zebrad="8a7b023" net="Main" at zebrad/src/application.rs:254 ```

Error 2

calling fail_with on already-failed connection state: failed connections must stop processing pending requests and responses, then close the connection. state: Failed original error: SharedPeerError(Serialization(Io(Os { code: 104, kind: ConnectionReset, message: "Connection reset by peer" }))) new error: SharedPeerError(Serialization(Io(Os { code: 32, kind: BrokenPipe, message: "Broken pipe" }))) client receiver: ClientRequestReceiver { inner: Receiver { inner: Some(BoundedInner { buffer: 0, state: 0, message_queue: Queue { head: 0x7fdcd4830060, tail: UnsafeCell }, parked_queue: Queue { head: 0x7fdcb41f21f0, tail: UnsafeCell }, num_senders: 2, recv_task: AtomicWaker }) } }

Metadata 2

key value
version 1.0.0-alpha.1
git commit 1edb379e
Zcash network Mainnet
location /home/dev/zebra/zebra-network/src/peer/connection.rs:535:13

SpanTrace 2

``` SpanTrace: 0: zebra_network::peer::connection::msg_as_req with msg=getdata at zebra-network/src/peer/connection.rs:762 1: zebra_network::peer::handshake::peer with addr=163.172.222.30:8233 at zebra-network/src/peer/handshake.rs:198 2: zebrad::application:: with zebrad="1edb379e" net="Main" at zebrad/src/application.rs:254 ```

Logs 2

``` Jan 30 07:09:34.078 INFO {zebrad="1edb379e" net="Main"}:sync:obtain_tips: zebra_state::service: pruned utxo requests old_len=4 new_len=0 prune_count=4 tip=Some((Height(1129116), block::Hash("0000000002121f23ad2a4fa88aa1c700e0d0535354d0e8d477f92e454d5fdf81"))) Jan 30 07:09:34.078 INFO {zebrad="1edb379e" net="Main"}:sync:obtain_tips:state: zebra_state::util: created block locator tip_height=Height(1129116) min_locator_height=1129017 locators=[Height(1129116), Height(1129115), Height(1129114), Height(1129112), Height(1129108), Height(1129100), Height(1129084), Height(1129052), Height(1129017)] Jan 30 07:09:34.330 INFO {zebrad="1edb379e" net="Main"}:sync: zebrad::components::sync: exhausted prospective tip set Jan 30 07:09:34.330 INFO {zebrad="1edb379e" net="Main"}:sync: zebrad::components::sync: waiting to restart sync timeout=61s Jan 30 07:10:35.331 INFO {zebrad="1edb379e" net="Main"}:sync: zebrad::components::sync: starting sync, obtaining new tips Jan 30 07:10:35.332 INFO {zebrad="1edb379e" net="Main"}:sync:obtain_tips:state: zebra_state::util: created block locator tip_height=Height(1129116) min_locator_height=1129017 locators=[Height(1129116), Height(1129115), Height(1129114), Height(1129112), Height(1129108), Height(1129100), Height(1129084), Height(1129052), Height(1129017)] Jan 30 07:10:41.332 INFO {zebrad="1edb379e" net="Main"}:sync: zebrad::components::sync: exhausted prospective tip set Jan 30 07:10:41.332 INFO {zebrad="1edb379e" net="Main"}:sync: zebrad::components::sync: waiting to restart sync timeout=61s Jan 30 07:11:42.333 INFO {zebrad="1edb379e" net="Main"}:sync: zebrad::components::sync: starting sync, obtaining new tips Jan 30 07:11:42.334 INFO {zebrad="1edb379e" net="Main"}:sync:obtain_tips:state: zebra_state::util: created block locator tip_height=Height(1129116) min_locator_height=1129017 locators=[Height(1129116), Height(1129115), Height(1129114), Height(1129112), Height(1129108), Height(1129100), Height(1129084), Height(1129052), Height(1129017)] Jan 30 07:11:48.335 INFO {zebrad="1edb379e" net="Main"}:sync: zebrad::components::sync: extending tips tips.len=1 in_flight=3 lookahead_limit=2000 Jan 30 07:11:54.337 INFO {zebrad="1edb379e" net="Main"}:sync: zebrad::components::sync: exhausted prospective tip set Jan 30 07:11:54.337 INFO {zebrad="1edb379e" net="Main"}:sync: zebrad::components::sync: waiting to restart sync timeout=61s Jan 30 07:12:19.657 INFO {zebrad="1edb379e" net="Main"}:peer{addr=163.172.222.30:8233}:msg_as_req{msg=getheaders}: zebra_state::service: pruned utxo requests old_len=7 new_len=5 prune_count=2 tip=Some((Height(1129116), block::Hash("0000000002121f23ad2a4fa88aa1c700e0d0535354d0e8d477f92e454d5fdf81"))) Jan 30 07:12:19.658 INFO {zebrad="1edb379e" net="Main"}:peer{addr=163.172.222.30:8233}:msg_as_req{msg=getheaders}:state: zebra_state::service: responding to peer GetBlocks or GetHeaders final_height=Height(1129116) response_len=20 chain_tip_height=Height(1129116) stop_height=None intersection_height=Some(Height(1129096)) The application panicked (crashed). Message: calling fail_with on already-failed connection state ... ```
teor2345 commented 3 years ago

@dconnolly do you have the logs leading up to this panic?

teor2345 commented 3 years ago

It's hard to tell without a backtrace, but it looks like we're failing on this fail_with call:

self.fail_with(PeerError::WrongMessage("getdata with mixed item types"));
dconnolly commented 3 years ago

@dconnolly do you have the logs leading up to this panic?

Not on hand, this was reported via Sentry, I can go digging in the logger

teor2345 commented 3 years ago

@dconnolly do you have the logs leading up to this panic?

Not on hand, this was reported via Sentry, I can go digging in the logger

That might be helpful, but I think I've found at least two potential causes already.

teor2345 commented 3 years ago

This issue wasn't completely fixed by #1600 or #1531, I'm still seeing it on commit 1edb379e (#1620 + #1622).

teor2345 commented 3 years ago

Duplicate of #1610

teor2345 commented 3 years ago

Marking this as high, because we're actually seeing the bug happen reasonably frequently.

mpguerra commented 3 years ago

Shall we close this duplicate? I keep getting distracted by it before realizing we're doing the work in #1610...

teor2345 commented 3 years ago

@mpguerra would it help to rename the issue?

I'd really like to keep it in the sprint, because it's the underlying bug that we have to fix. We just think the refactor is the best way to fix it, because we've tried little patches before, and they haven't solved all the panics.

As an alternative, we could move the bug tags over to #1610, and mark this issue as a duplicate using GitHub syntax. Which is a bit fussy.

teor2345 commented 3 years ago

Duplicate of #1610

mpguerra commented 3 years ago

@mpguerra would it help to rename the issue?

I'd really like to keep it in the sprint, because it's the underlying bug that we have to fix. We just think the refactor is the best way to fix it, because we've tried little patches before, and they haven't solved all the panics.

As an alternative, we could move the bug tags over to #1610, and mark this issue as a duplicate using GitHub syntax. Which is a bit fussy.

It's ok, let's keep it then, I think I'm used to seeing it and connecting it to the other one by now!

teor2345 commented 3 years ago

Just to be clear, it's the GitHub "Duplicate of" syntax that's a bit fussy, not anyone in the team!

Unlike Closes or Fixes, it needs to be in a comment by itself.

teor2345 commented 3 years ago

We think this issue was caused by continuing to send block or transaction messages after failure:

            Response::Blocks(blocks) => {
                // Generate one block message per block.
                for block in blocks.into_iter() {
                    if let Err(e) = self.peer_tx.send(Message::Block(block)).await {
                        self.fail_with(e);
                    }
                }
            }

https://github.com/ZcashFoundation/zebra/pull/1721/files/5a64d988e03f799e7b3495cbc00409799016c85f#r580708590

This code was rewritten to exit early by #1721.

teor2345 commented 3 years ago

1721 was reverted by #1803, and we did a much smaller fix in #1818.