Improve `reqwests` error display

crepererum commented 1 month ago

Originally reported in https://github.com/apache/arrow-rs/issues/5882#issuecomment-2340037342 :

For #5882 we only report Generic S3 error: error decoding response body. We should try to improve the error message. It seems that the Display impl. is not super helpful and we might wanna use Debug instead, see https://github.com/seanmonstar/reqwest/issues/2373

tustvold commented 1 month ago

One thing to be aware of is exposing sensitive headers, or URL path segments.

We could also provide information on common causes of this error - e.g. network/runtime instability

itsjunetime commented 1 month ago

take

erratic-pattern commented 1 month ago

take

erratic-pattern commented 1 month ago

One thing to be aware of is exposing sensitive headers, or URL path segments. We could also provide information on common causes of this error - e.g. network/runtime instability

The documentation for reqwest::Error only mentions URLs as a concern. Hopefully this means that there are no headers in the output, but I do not know for certain.

There is reqwest::Error::without_url which can be used to scrub the URL from the reqwest error. Either the user could explicitly call this when they have sensitive information in URLs, or we could provide a helper method to accomplish this.

Another option is to have for each object storage API provide its own scrubbing method to eliminate specific sensitive headers from AWS, Azure, etc. I do not know enough about these object storage APIs to make this change.

For now I am just going to change to using the Debug information to see if it provides enough information.

crepererum commented 3 weeks ago

Regarding the argument that @tustvold brought up here: https://github.com/apache/arrow-rs/pull/6518#issuecomment-2400700542 and that basically boils down to "people should walk the error chain":

I think the issue here is that object_store is somewhat inconsistent: for some errors, Display prints the entire chain e1: e2: e3 but for reqwest errors it does not. That also make the proposal to "walk the error chain" somewhat impractical, because if you walk the chain and use Display to print the chain parts, you gonna repeat some errors. So our standpoint is to really tell people that the chain should be walked, then stuff like this

https://github.com/apache/arrow-rs/blob/4adbeee14c40c080f8fca3bd42c0c3856bbb151c/object_store/src/aws/client.rs#L69-L70

is wrong and

#[snafu(display("Error bla bla: {}", source))]

should be

#[snafu(display("Error bla bla"))]

tustvold commented 3 weeks ago

I personally agree that walking the error chain should be the approach, it's the pattern used by the vast majority of other languages, but unfortunately support for this in the ecosystem is limited. If Rust had supported this from day one, perhaps we wouldn't be in this mess, but alas it did not. As it stands I'm honestly not even sure what the consensus is...

Moving to this particular issue, I think we need to surface enough information within the default display implementation to diagnose issues, however, this does not mean we should just expose everything, including what may be sensitive or internal details. My point w.r.t walking the error chain is that in the absence of a clearly articulated error scenario the only option is to either expose everything, which I don't agree with, or instruct people to walk the error chain to get the more detailed information they require.

Perhaps we could reframe this issue as a concrete, what would I like to be in the output that currently is not

crepererum commented 3 weeks ago

Perhaps we could reframe this issue as a concrete, what would I like to be in the output that currently is not

Fine with me. I think #5882 provides a good example. The reqwest error that is stringified only says error decoding response body which is somewhat misleading w/o looking at the error chain. It sounds like the body data is malformed or the data got scrambled, but if you walk the error chain you'll often see it's simply a timeout that occurred while fetching the body bytes. So I wonder if we should wrap reqwest::Error into another error type that tries to extract SOME useful information (it could itself walk the chain and find the underlying IO error for example).

tustvold commented 3 weeks ago

This sounds reasonable to me, it certainly would help users to make more clear that error means that the request was interrupted mid-stream, even if there isn't really anything more that can be said about why.

erratic-pattern commented 3 weeks ago

I feel that cramming more error sources into the Display is a step in the wrong direction. Despite the half-baked edges of Rust error handling making this a tricky decision for library maintainers, following standard practices should lead to a better outcome for both application developers and library developers.

As @crepererum pointed out, right now for a application developer, there is a confusing incompatibility in error messaging between Arrow ecosystem crates and (most?) other third-party crates. Using Display seems to "just work" for getting detailed information for Arrow crates, until the error goes into a third-party crate and then it doesn't.

"So just use an error reporting library, "as the reqwest maintainers suggest. Well, no. Even if you follow "best practice" you're still hindered as an application developer because using any of the available error reporting crates with arrow-rs will produce very redundant and ugly error reports.

To highlight this visually I've made an example repo that creates a simple "application" using object_store with a popular error reporting crate called color_eyre. It attempts to connect to an S3 bucket that doesn't exist and then displays an error.

Here is the output:

Error:
   0: Generic S3 error: Error after 10 retries in 5.723990834s, max_retries:10, retry_timeout:180s, source:error sending request for url (http://169.254.169.254/latest/api/token)
   1: Error after 10 retries in 5.723990834s, max_retries:10, retry_timeout:180s, source:error sending request for url (http://169.254.169.254/latest/api/token)
   2: error sending request for url (http://169.254.169.254/latest/api/token)
   3: client error (Connect)
   4: tcp connect error: Host is down (os error 64)
   5: Host is down (os error 64)

Location:
   examples/object_store.rs:11

Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.

Maybe this is tolerable, but it's definitely ugly. You are at least getting as much information about the error chain as possible. Additionally, by cramming more information into the top-level error message, you're helping users in the default case to discover useful errors beyond just "Generic S3 error". So there is an argument to be made that the current state is acceptable, even if not ideal..

I think this discussion is beyond the scope of just "improve reqwests error display" and we should create an issue to discuss a more broad improvement to error messaging with both the default rust panic handler as well as the color_eyre (or similar crate) panic handler. If that sounds reasonable, I can write one up with points made from this discussion.

I would also like to create a similar example repo for DataFusion to bring up the discussion there, since they also use the same approach to error messaging, but I think the discussion needs to happen here primarily since DataFusion is largely required to follow whatever convention arrow-rs takes.

erratic-pattern commented 3 weeks ago

From our (InfluxData's) side, we are most likely going to unroll the cause chain for object_store errors manually.

apache / arrow-rs

Improve `reqwests` error display #6377