We should analyze errors that we're seeing in production, and update our list of retriable errors to include all expected transient error scenarios.
We may need to improve the "retriable error" detection to consider more than just the message. For example, we may want to take into consideration the response headers or status code as well.
Additional thoughts: Currently we handle 418 as a rate limit error. Perhaps we should also handle 429 as a rate limit error, and/or treat both 418 and 429 as retriable errors.
We should analyze errors that we're seeing in production, and update our list of retriable errors to include all expected transient error scenarios.
We may need to improve the "retriable error" detection to consider more than just the message. For example, we may want to take into consideration the response headers or status code as well.