HDFGroup / hsds

Cloud-native, service based access to HDF data
https://www.hdfgroup.org/solutions/hdf-kita/
Apache License 2.0
129 stars 53 forks source link

hsload corrupted files during linked load #71

Closed jpswinski closed 3 years ago

jpswinski commented 3 years ago

During an overnight load of large (750MB to 6GB) h5 files to HSDS using the hsload --link command, 8 out of 63 files were afterward found to be corrupted. Attempts to access those files resulted in the following error:

HDF5 REST VOL-DIAG: Error detected in HDF5 REST VOL (1.0.0) thread 139748614063872:
  #000: vol-rest/src/rest_vol_dataset.c line 321 in RV_dataset_open(): can't locate dataset by path
    major: Dataset
    minor: Problem with path to object
  #001: vol-rest/src/rest_vol.c line 1834 in RV_find_object_by_path(): can't locate parent group for object of unknown type
    major: Symbol table
    minor: Problem with path to object
HDF5-DIAG: Error detected in HDF5 (1.12.0) thread 139748614063872:
  #000: ../../src/H5D.c line 296 in H5Dopen2(): unable to open dataset
    major: Dataset
    minor: Can't open object
  #001: ../../src/H5VLcallback.c line 1974 in H5VL_dataset_open(): dataset open failed
    major: Virtual Object Layer
    minor: Can't open object
  #002: ../../src/H5VLcallback.c line 1941 in H5VL__dataset_open(): dataset open failed
    major: Virtual Object Layer
    minor: Can't open object

Subsequent reloads of the corrupted files using the same hsload command, fixed the problem. During the load process, the following error (output from hsload) has been observed multiple times (requiring manual reloading of the file). It is assumed that this is the cause of the problem, but we have not been able to conclusively demonstrate it.

ERROR 2020-11-02 14:22:00,575 utillib.py:455 ERROR: failed to create dataset: Gateway Timeout
Traceback (most recent call last):
  File "h5py/h5o.pyx", line 302, in h5py.h5o.cb_obj_simple
  File "/home/ubuntu/.pyenv/versions/3.8.3/lib/python3.8/site-packages/h5py-2.10.0-py3.8-linux-x86_64.egg/h5py/_hl/group.py", line 600, in proxy
    return func(name, self[name])
  File "/home/ubuntu/.pyenv/versions/3.8.3/lib/python3.8/site-packages/h5pyd-0.8.0-py3.8.egg/h5pyd/_apps/utillib.py", line 658, in object_create_helper
    create_dataset(obj, ctx)
  File "/home/ubuntu/.pyenv/versions/3.8.3/lib/python3.8/site-packages/h5pyd-0.8.0-py3.8.egg/h5pyd/_apps/utillib.py", line 457, in create_dataset
    return dset
UnboundLocalError: local variable 'dset' referenced before assignment

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "./.pyenv/versions/3.8.3/bin/hsload", line 11, in <module>
    load_entry_point('h5pyd==0.8.0', 'console_scripts', 'hsload')()
  File "/home/ubuntu/.pyenv/versions/3.8.3/lib/python3.8/site-packages/h5pyd-0.8.0-py3.8.egg/h5pyd/_apps/hsload.py", line 309, in main
    load_file(fin, fout, verbose=verbose, dataload=dataload, s3path=s3path, compression=compression, compression_opts=compression_opts)
  File "/home/ubuntu/.pyenv/versions/3.8.3/lib/python3.8/site-packages/h5pyd-0.8.0-py3.8.egg/h5pyd/_apps/utillib.py", line 698, in load_file
    fin.visititems(object_create_helper)
  File "/home/ubuntu/.pyenv/versions/3.8.3/lib/python3.8/site-packages/h5py-2.10.0-py3.8-linux-x86_64.egg/h5py/_hl/group.py", line 601, in visititems
    return h5o.visit(self.id, proxy)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
SystemError: <built-in function visit> returned a result with an error set
jreadey commented 3 years ago

As it happens I was doing testing for large data loads last week and saw the same errors. I think fixes should be in place now (at least I've been able to process 100's of files and TB's of data without errors so far).

The primary problem seems to be that in h5pyd POST requests aren't retried on failure - this was apparently by design in the urrlib3 Retry class. I made a change to explicitly enable POST retries here: https://github.com/HDFGroup/h5pyd/commit/a9e45f107b8fe881ffd6a9be7af5aee69ffd4fe8.

The UnboundLocalError was caused by the dest variable not being initialized when an exception was triggered. Fix is here: https://github.com/HDFGroup/h5pyd/commit/e1b12c8231994311b352d1285ccb8e2162600ea5.

I'll bump the h5pyd release and tag it in github. Then if you can rerun your overnight load and confirm there are no problems, that would be great.

jreadey commented 3 years ago

@jpswinski - is this still an issue, or can I close this now?

jpswinski commented 3 years ago

I have not seen a reoccurrence of this issue since updating hsds to include your fix. So from my perspective it can be closed.

jreadey commented 3 years ago

Closing!