Unexpected error messages after upgrading from 0.17.0 to 0.21.0

bajtos commented 1 year ago

While upgrading rusty-lassie from Lassie 0.17.0 to 0.21.0 (see https://github.com/filecoin-station/rusty-lassie/pull/68), I noticed new error messages in the test output.

These messages seem to be printed only when the retrieval is aborted via "max blocks" or "global timeout" feature.

Is this a known issue? How can I get rid of these messages?

The messages are repeated many times in my test output; I am posting one example for each. I also changed the formatting to make things easier to read.

2023-10-24T11:04:22.574+0200    ERROR   bitswap
providerquerymanager/providerquerymanager.go:344    
Received provider (12D3KooWNy2SmgY384B9EHVHdbSkvXEADwxjzd4ejXQqicxWpxxT) for cid (Qme3KPUshfASLi2kn78dr69hsNsW6DnpnFHnbBHyECiTjp) not requested

2023-10-24T11:04:23.191+0200    ERROR   lassie/retriever    
retriever/parallelpeerretriever.go:277  
Error recording retrieval failure for protocol transport-ipfs-gateway-http: internal error, no such active retrieval 80e3c2b4-0797-4bfc-bc72-80ee3df5fb90

cc @hannahhoward @rvagg

rvagg commented 1 year ago

Thanks for the log output @bajtos; this is likely related to some context handling done across all the parallel workloads where cancellation needs to interrupt it all and there's been a few changes in that version range that adjust the timing subtly such that these errors may pop up. I haven't seen these particular ones but I have been bothered by some odd "context cancelled" log messages that shouldn't be seen!

What I can say is that neither of these errors is fatal, they're more of the type of "I probably shouldn't get in this state". They're annoying and they point to something that should be addressed, but I'm pretty confident they're just related to the retrieval bail state - the timing of "shut it all down" not hitting all the right things in the right order, but that doesn't indicate clean-up problems.

The first error is bitswap - we got a peer from the indexer and fed it in to bitswap but at the same time we also told bitswap to stop bothering with this CID.

The second error is from the SP scoring mechanism—keeping track of how SP's perform so we can make better choices about prioritising. "you told me a candidate failed a retrieval, but you already told me that retrieval had ended".

I'll look in to it and see if I can replicate it these.

bajtos commented 1 year ago

Thank you, @rvagg, for the detailed explanation! ❤️

I also assumed these errors were not fatal, so I am not worried. I agree with you these logs are annoying; that's why I opened this issue.

I can reliably reproduce this issue on my machine when I run cargo test in https://github.com/filecoin-station/rusty-lassie/commit/5b0f37c8ae88ed4df2d0c64f5a68e96e52f4f20a

❯ cargo version
cargo 1.72.0 (103a7ff2e 2023-08-15)
❯ rustc --version
rustc 1.72.0 (5680fa18f 2023-08-23)
❯ go version
go version go1.20.4 darwin/arm64

rvagg commented 1 year ago

(a note for myself that I started tinkering with test cases to try and capture this in my local branch rvagg/max-block-limit-probs-tmp, didn't make enough progress to share anything yet though, just don't want to forget what that branch is for!)

filecoin-project / lassie

Unexpected error messages after upgrading from 0.17.0 to 0.21.0 #454