Closed dconnolly closed 3 years ago
@dconnolly do you have the logs leading up to this panic?
It's hard to tell without a backtrace, but it looks like we're failing on this fail_with
call:
self.fail_with(PeerError::WrongMessage("getdata with mixed item types"));
@dconnolly do you have the logs leading up to this panic?
Not on hand, this was reported via Sentry, I can go digging in the logger
@dconnolly do you have the logs leading up to this panic?
Not on hand, this was reported via Sentry, I can go digging in the logger
That might be helpful, but I think I've found at least two potential causes already.
This issue wasn't completely fixed by #1600 or #1531, I'm still seeing it on commit 1edb379e (#1620 + #1622).
Duplicate of #1610
Marking this as high, because we're actually seeing the bug happen reasonably frequently.
Shall we close this duplicate? I keep getting distracted by it before realizing we're doing the work in #1610...
@mpguerra would it help to rename the issue?
I'd really like to keep it in the sprint, because it's the underlying bug that we have to fix. We just think the refactor is the best way to fix it, because we've tried little patches before, and they haven't solved all the panics.
As an alternative, we could move the bug tags over to #1610, and mark this issue as a duplicate using GitHub syntax. Which is a bit fussy.
Duplicate of #1610
@mpguerra would it help to rename the issue?
I'd really like to keep it in the sprint, because it's the underlying bug that we have to fix. We just think the refactor is the best way to fix it, because we've tried little patches before, and they haven't solved all the panics.
As an alternative, we could move the bug tags over to #1610, and mark this issue as a duplicate using GitHub syntax. Which is a bit fussy.
It's ok, let's keep it then, I think I'm used to seeing it and connecting it to the other one by now!
Just to be clear, it's the GitHub "Duplicate of" syntax that's a bit fussy, not anyone in the team!
Unlike Closes or Fixes, it needs to be in a comment by itself.
We think this issue was caused by continuing to send block or transaction messages after failure:
Response::Blocks(blocks) => {
// Generate one block message per block.
for block in blocks.into_iter() {
if let Err(e) = self.peer_tx.send(Message::Block(block)).await {
self.fail_with(e);
}
}
}
This code was rewritten to exit early by #1721.
Duplicate
This panic will be fixed by the fail_with refactor - so this ticket is a duplicate of #1610.
Motivation
These panics are related to https://github.com/ZcashFoundation/zebra/issues/1510 ~which should have been fixed with~ which was partly fixed by https://github.com/ZcashFoundation/zebra/pull/1531.
Edit: we knew #1531 fixed some potential causes of #1510, but we weren't sure if we'd found them all
Issue report 1 is from commit 8a7b023. Issue report 2 is from commit 1edb379e (#1620 + #1622). The backtraces don't contain any useful information.
Analysis 2
GetHeaders
thenGetData
(get blocks) on the same connection.GetHeaders
(or some intervening request) fails with "Connection reset by peer".GetData
request fails with a "Broken pipe" error.When the
GetHeaders
request fails, Zebra should stop processing any further requests from that peer. Instead, it continues to try to process inbound messages from that peer.Analysis 1
Possibly resolved in #1600.
This panic can happen in two different ways:
drive_peer_request
, which doesn't check for a failed connection state before performing tasks that can potentially callfail_with
.fail_with
So I suggest that we add the missing failed connection check, downgrade the panic to a warning, and add additional debugging information.
Error 1
Metadata 1
SpanTrace 1
Error 2
Metadata 2
SpanTrace 2
Logs 2