SlideRuleEarth / sliderule

Server and client framework for on-demand science data processing in the cloud
https://slideruleearth.io
Other
27 stars 12 forks source link

Retry S3 GET on some error codes #336

Open jpswinski opened 1 year ago

jpswinski commented 1 year ago

Every once in a while an S3 GET request will fail with an error code, and subsequent requests to the same object will succeed.

Currently the code handles timeouts and partial responses and will retry the request, but if S3 returns an HTTP error code, it will fail outright and not try again.

Consider looking at the error code and doing different things. For instance, failing outright on a 404 would be fine, but maybe a 500 merits a retry.

https://github.com/ICESat2-SlideRule/sliderule/blob/38764aaf2c948eccd14b40094eadad520d430961/packages/aws/S3CurlIODriver.cpp#L444-L464

The info.index will likely need to be set back to 0 on a failure. I'm not sure if the headers would need to be reset and what the effect is of reinitializing the curl structure.

jpswinski commented 1 year ago

Here is an error from the logs showing a 500 that subsequently was successful:

2023-09-27 11:49:35 | ip=10.0.172.45 level=critical caller=S3CurlIODriver.cpp:459 msg="S3 get returned http error <500>: data/GEDI/GEDI01_B_2019109210809_O01988_03_T02056_02_005_01_V002.h5"