MannLabs / alphabase

Infrastructure of AlphaX ecosystem
https://alphabase.readthedocs.io
Apache License 2.0
26 stars 8 forks source link

Race condition when reading speclib from two instances #180

Open mschwoer opened 3 weeks ago

mschwoer commented 3 weeks ago

Describe the bug When two alphaDIA instance access the same speclib file (for reading) in the same instance of time, alphabase throws an error (see below).

Expected behavior No error, as the file is just opened for reading in this cases I think the problem is that files are opened for 'a' here (hdf.py)

class HDF_File(HDF_Group):
    def __init__():
...
        mode = "w" if delete_existing else "a"
        with h5py.File(file_name, mode):  # , swmr=True):
            pass

Logs

[2024-06-12, 15:19:11 UTC] {ssh.py:526} INFO - 0:00:00.022445 INFO: Running DynamicLoader
[2024-06-12, 15:19:11 UTC] {ssh.py:526} INFO - 0:00:00.027390 INFO: Loading .hdf library from /fs/hela_hybrid.small.hdf
[2024-06-12, 15:19:11 UTC] {ssh.py:526} INFO - 0:00:00.031234 INFO: Traceback (most recent call last):
[2024-06-12, 15:19:11 UTC] {ssh.py:526} INFO -   File "/fs/home/xx/conda-envs/alphadia-1.6.2/lib/python3.11/site-packages/alphadia/cli.py", line 333, in run
[2024-06-12, 15:19:11 UTC] {ssh.py:526} INFO -     plan = Plan(
[2024-06-12, 15:19:11 UTC] {ssh.py:526} INFO -            ^^^^^
[2024-06-12, 15:19:11 UTC] {ssh.py:526} INFO -   File "/fs/home/kraken/conda-envs/alphadia-1.6.2/lib/python3.11/site-packages/alphadia/planning.py", line 126, in __init__
[2024-06-12, 15:19:11 UTC] {ssh.py:526} INFO -     self.load_library()
[2024-06-12, 15:19:11 UTC] {ssh.py:526} INFO -   File "/fs/home/kraken/conda-envs/alphadia-1.6.2/lib/python3.11/site-packages/alphadia/planning.py", line 205, in load_library
[2024-06-12, 15:19:11 UTC] {ssh.py:526} INFO -     spectral_library = dynamic_loader(self.library_path)
[2024-06-12, 15:19:11 UTC] {ssh.py:526} INFO -                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2024-06-12, 15:19:11 UTC] {ssh.py:526} INFO -   File "/fs/home/kraken/conda-envs/alphadia-1.6.2/lib/python3.11/site-packages/alphadia/libtransform.py", line 40, in __call__
[2024-06-12, 15:19:11 UTC] {ssh.py:526} INFO -     return self.forward(*args)
[2024-06-12, 15:19:11 UTC] {ssh.py:526} INFO -            ^^^^^^^^^^^^^^^^^^^
[2024-06-12, 15:19:11 UTC] {ssh.py:526} INFO -   File "/fs/home/kraken/conda-envs/alphadia-1.6.2/lib/python3.11/site-packages/alphadia/libtransform.py", line 121, in forward
[2024-06-12, 15:19:11 UTC] {ssh.py:526} INFO -     library.load_hdf(input_path, load_mod_seq=True)
[2024-06-12, 15:19:11 UTC] {ssh.py:526} INFO -   File "/fs/home/kraken/conda-envs/alphadia-1.6.2/lib/python3.11/site-packages/alphabase/spectral_library/base.py", line 681, in load_hdf
[2024-06-12, 15:19:11 UTC] {ssh.py:526} INFO -     _hdf = HDF_File(
[2024-06-12, 15:19:11 UTC] {ssh.py:526} INFO -            ^^^^^^^^^
[2024-06-12, 15:19:11 UTC] {ssh.py:526} INFO -   File "/fs/home/kraken/conda-envs/alphadia-1.6.2/lib/python3.11/site-packages/alphabase/io/hdf.py", line 533, in __init__
[2024-06-12, 15:19:11 UTC] {ssh.py:526} INFO -     with h5py.File(file_name, mode):#, swmr=True):
[2024-06-12, 15:19:11 UTC] {ssh.py:526} INFO -          ^^^^^^^^^^^^^^^^^^^^^^^^^^
[2024-06-12, 15:19:11 UTC] {ssh.py:526} INFO -   File "/fs/home/kraken/conda-envs/alphadia-1.6.2/lib/python3.11/site-packages/h5py/_hl/files.py", line 562, in __init__
[2024-06-12, 15:19:11 UTC] {ssh.py:526} INFO -     fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr)
[2024-06-12, 15:19:11 UTC] {ssh.py:526} INFO -           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2024-06-12, 15:19:11 UTC] {ssh.py:526} INFO -   File "/fs/home/kraken/conda-envs/alphadia-1.6.2/lib/python3.11/site-packages/h5py/_hl/files.py", line 247, in make_fid
[2024-06-12, 15:19:11 UTC] {ssh.py:526} INFO -     fid = h5f.open(name, h5f.ACC_RDWR, fapl=fapl)
[2024-06-12, 15:19:11 UTC] {ssh.py:526} INFO -           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2024-06-12, 15:19:11 UTC] {ssh.py:526} INFO -   File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
[2024-06-12, 15:19:11 UTC] {ssh.py:526} INFO -   File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
[2024-06-12, 15:19:11 UTC] {ssh.py:526} INFO -   File "h5py/h5f.pyx", line 102, in h5py.h5f.open
[2024-06-12, 15:19:11 UTC] {ssh.py:526} INFO - BlockingIOError: [Errno 11] Unable to synchronously open file (unable to lock file, errno = 11, error message = 'Resource temporarily unavailable')
[2024-06-12, 15:19:11 UTC] {ssh.py:526} INFO - 
jalew188 commented 3 weeks ago

Try export HDF5_USE_FILE_LOCKING='FALSE' before running multiple tasks

jalew188 commented 3 weeks ago

To be honest, I also don't know why it is a instead of r here...

jalew188 commented 1 week ago

Try export HDF5_USE_FILE_LOCKING='FALSE' before running multiple tasks

@mschwoer Is the issue solved by this command?

mschwoer commented 1 week ago

didn't check yet.. but generally, I feel that file locking has its benefits, to prevent corruption by simultaneous writing. So disabling it would make things less robuts. But could we not just change the "a" into an "r" in the piece of code mentioned above?

jalew188 commented 1 week ago

I think Sander use "a" instead of "r" for a purpose ... I think we should add a readonly kwargs to the HDF reader

mschwoer commented 4 days ago

there is already a read_only parameter.. can't we leverage that like

        if delete_existing:
            mode = "w"
        elif read_only:
            mode = "r"
        else:
            mode = "a"
        with h5py.File(file_name, mode):  # , swmr=True):
            pass