dbbs-lab / bsb-core

The Brain Scaffold Builder
https://bsb.readthedocs.io
GNU General Public License v3.0
22 stars 16 forks source link

Loading multiple time the same file with MPI #893

Open drodarie opened 1 month ago

drodarie commented 1 month ago

Running the following python script:

from bsb import from_storage

forward_model = from_storage('mouse_cereb_dcn_io_nest.hdf5')
inverse_model = from_storage('mouse_cereb_dcn_io_nest2.hdf5')

with MPI (the more core the more likely it is to happen), produce the following stacktrace:

Traceback (most recent call last):
  File "/home/toromis/workspace/dbbs/test_load_net.py", line 15, in <module>
    inverse_model = from_storage('mouse_cerebellum.hdf5')
  File "/home/toromis/workspace/dbbs/bsb/bsb-core/bsb/profiling.py", line 159, in decorated
    return f(*args, **kwargs)
  File "/home/toromis/workspace/dbbs/bsb/bsb-core/bsb/core.py", line 50, in from_storage
    return open_storage(root).load()
  File "/home/toromis/workspace/dbbs/bsb/bsb-core/bsb/storage/__init__.py", line 379, in open_storage
    raise IOError(
OSError: Storage `mouse_cerebellum.hdf5` not recognized as any installed format: 'hdf5', 'fs'

and get stuck. Here, this issue was raised in only one core but can also happen in several cores.

drodarie commented 3 weeks ago

Ok after debugging a bit more, it seems to be an h5py lock issue again:

[Errno 11] Unable to synchronously open file (unable to lock file, errno = 11, error message = 'Resource temporarily unavailable')

Still I have no clue why bsb is not detecting the raised exception and stopping the loading.

Helveg commented 2 weeks ago

Hmm, AFAICT loading multiple files shouldn't lead to any issues since they should both create their own MPILock instances, no?

drodarie commented 2 weeks ago

Sorry, I was not clear enough. I strongly believe the underlying issue is linked to hdf5 not properly locking the file, not bsb. However, there is clearly an issue raised in one of the MPI cores that is not detected by the others and that lead to this situation where the whole bsb process get stuck. Probably, by the time one core failed and raised the issue, the other cores reached a point where they have wait for the failing one? Should we add a fail safe that would kill all MPI process in case an exception is raised? With MPI_Abort for instance?

Helveg commented 2 weeks ago

The issue here is that bsb-hdf5 fails on only 1 participating MPI rank. Since the HDF5 engine fails to open the file on that rank, the BSB tries to open it with the next available engines, which all fail to open the file too, so on that rank the file is deemed to be

not recognized as any installed format: 'hdf5', 'fs'

All storage engines should be MPI aware, and so it's bsb-hdf5's resonsibility during open_storage, which calls engine.peek_exists and engine.recognizes, to orchestrate both these functions correctly. Either by executing them only on 1 rank and communicating the results on the MPI comm, or to execute it on each rank, but then still checking whether every other rank succeeded as well.

The cause of the locking issue might be because the MPILock might not be used for peek_exists or recognizes. The MPILock is created when the Storage is created, and I believe both of these operations are static methods.

All of this together points to the following solution:

Any static operation on BSB-HDF5 will lack an MPILock to synchronize concurrent access, so we should only execute these operations on 1 rank and broadcast the result.

This will fix the whole issue, but maybe we can do more to prevent cryptic crashes like this in the future? Since Python 3.11 we can raise ExceptionGroups. This might be a good place to raise all of the errors that each storage engine encountered, and to conclude with the final message not recognized as any installed format: 'hdf5', 'fs', so that none of the errors are swallowed and the user can clearly see whether the engine failed intentionally, or an error like this occurred. We should then probably check what the most common failures look like for each engine, and raise descriptive errors for each, like "This is not an HDF5 file" or "This file does not exist".