application-research / autoretrieve

A server to make GraphSync data accessible on IPFS
22 stars 7 forks source link

Documenting "Response Rejected" error history #120

Closed hannahhoward closed 1 year ago

hannahhoward commented 2 years ago

The most common error we get from providers on retrievals reads: "data transfer failed: datatransfer error: response rejected"

This error is generated from go-data-transfer, and returned here: https://github.com/application-research/filclient/blob/master/filclient.go#L1606

The code in go-data-transfer that generates this error is here: https://github.com/filecoin-project/go-data-transfer/blob/6109651443233cd35e05420078db84cd3130044b/impl/events.go#L225

As you can see from the surrounding code, we are processing a data transfer response that may contain a voucher result -- https://github.com/filecoin-project/go-data-transfer/blob/6109651443233cd35e05420078db84cd3130044b/impl/events.go#L212-L242 . If a voucher result is present in the data transfer response, as you can see, a "NewVoucherResult" event is generated before the "Error" event in go-data-transfer (https://github.com/filecoin-project/go-data-transfer/blob/6109651443233cd35e05420078db84cd3130044b/impl/events.go#L213-L222). If this event occurs, a different error should be generated in fil client from the "NewVoucherResult" event -- https://github.com/application-research/filclient/blob/master/filclient.go#L1666-L1677 -- instead of the error we get. In fact, we see some of these other errors in logs.

This suggests that the error occurs when there is a data transfer message sent back with a "NOT" accepted header, indicating the deal was rejected, but no voucher result was included in the message.

There are only a couple code paths in go-fil-markets I am aware of that might cause this error to occur. The go-fil-markets code that decides to accept or reject go-data-transfer requests is here: https://github.com/filecoin-project/go-fil-markets/blob/master/retrievalmarket/impl/requestvalidation/requestvalidation.go

The code paths I can see that generate no voucher result are here-- https://github.com/filecoin-project/go-fil-markets/blob/master/retrievalmarket/impl/requestvalidation/requestvalidation.go#L99-L115 and here -- https://github.com/filecoin-project/go-fil-markets/blob/master/retrievalmarket/impl/requestvalidation/requestvalidation.go#L148-L151, also possibly here -- https://github.com/filecoin-project/go-fil-markets/blob/master/retrievalmarket/impl/requestvalidation/requestvalidation.go#L69

Unfortunately we have little logging on the server side of what happens. My suggestion is to at minimum, start logging when there we're about to error a new graphsync request: https://github.com/filecoin-project/go-data-transfer/blob/master/transport/graphsync/graphsync.go#L643. Probably INFO level log is sufficient, since it's worth knowing but not really a warning or error case. I would log the error and the message you're about to send if it's not nil. Then, I'd see if we can get a provider to run it.

rvagg commented 2 years ago

So there's https://github.com/filecoin-project/go-data-transfer/pull/345 for more information on the SP side, but still requires someone to run it, us to be able to trigger it with them, and them to be able to collect and send us the error.

In the meantime, the most interesting thing I've discovered about this is that I cannot trigger the error myself, with my own peer ID. When I use bedrock-dev's autoretrieve peer ID I can trigger it with some frequency between 20% and 50% of the time using the CID/SP pairs from the autoretrieve event log database where this error occurs. If that's not just some fluke then I suppose this could be:

  1. Something in the networking stack—although I can't yet see any valid path for that to bubble up to a response rejected.
  2. Something to do with the ask processing, in ProviderRequestValidator#acceptDeal—although this should result in a voucher response because there doesn't seem to be any path for the return there to not do that; but perhaps there's a path here that we're missing.
  3. Something weird about the peer ID itself that's making the validation bork, such as an odd sequence of bytes in the ID that makes a state check/save/whatever hiccup. Alternatively maybe there's some state change closer to the networking stack based on peer ID that has an impact higher up the stack when it gets to request validation. This is a pretty wild theory but crazier things have happened and we do mess around with bytes a whole lot.
rvagg commented 2 years ago

https://github.com/application-research/filclient/pull/102 seems to fix this