Closed jpswinski closed 3 years ago
As it happens I was doing testing for large data loads last week and saw the same errors. I think fixes should be in place now (at least I've been able to process 100's of files and TB's of data without errors so far).
The primary problem seems to be that in h5pyd POST requests aren't retried on failure - this was apparently by design in the urrlib3 Retry class. I made a change to explicitly enable POST retries here: https://github.com/HDFGroup/h5pyd/commit/a9e45f107b8fe881ffd6a9be7af5aee69ffd4fe8.
The UnboundLocalError was caused by the dest variable not being initialized when an exception was triggered. Fix is here: https://github.com/HDFGroup/h5pyd/commit/e1b12c8231994311b352d1285ccb8e2162600ea5.
I'll bump the h5pyd release and tag it in github. Then if you can rerun your overnight load and confirm there are no problems, that would be great.
@jpswinski - is this still an issue, or can I close this now?
I have not seen a reoccurrence of this issue since updating hsds to include your fix. So from my perspective it can be closed.
Closing!
During an overnight load of large (750MB to 6GB) h5 files to HSDS using the hsload --link command, 8 out of 63 files were afterward found to be corrupted. Attempts to access those files resulted in the following error:
Subsequent reloads of the corrupted files using the same hsload command, fixed the problem. During the load process, the following error (output from hsload) has been observed multiple times (requiring manual reloading of the file). It is assumed that this is the cause of the problem, but we have not been able to conclusively demonstrate it.