Open drodarie opened 1 month ago
Ok after debugging a bit more, it seems to be an h5py lock issue again:
[Errno 11] Unable to synchronously open file (unable to lock file, errno = 11, error message = 'Resource temporarily unavailable')
Still I have no clue why bsb is not detecting the raised exception and stopping the loading.
Hmm, AFAICT loading multiple files shouldn't lead to any issues since they should both create their own MPILock
instances, no?
Sorry, I was not clear enough.
I strongly believe the underlying issue is linked to hdf5 not properly locking the file, not bsb.
However, there is clearly an issue raised in one of the MPI cores that is not detected by the others and that lead to this situation where the whole bsb process get stuck.
Probably, by the time one core failed and raised the issue, the other cores reached a point where they have wait for the failing one?
Should we add a fail safe that would kill all MPI process in case an exception is raised? With MPI_Abort
for instance?
The issue here is that bsb-hdf5
fails on only 1 participating MPI rank. Since the HDF5 engine fails to open the file on that rank, the BSB tries to open it with the next available engines, which all fail to open the file too, so on that rank the file is deemed to be
not recognized as any installed format: 'hdf5', 'fs'
All storage engines should be MPI aware, and so it's bsb-hdf5
's resonsibility during open_storage
, which calls engine.peek_exists
and engine.recognizes
, to orchestrate both these functions correctly. Either by executing them only on 1 rank and communicating the results on the MPI comm, or to execute it on each rank, but then still checking whether every other rank succeeded as well.
The cause of the locking issue might be because the MPILock might not be used for peek_exists
or recognizes
. The MPILock is created when the Storage
is created, and I believe both of these operations are static methods.
All of this together points to the following solution:
Any static operation on BSB-HDF5 will lack an MPILock to synchronize concurrent access, so we should only execute these operations on 1 rank and broadcast the result.
This will fix the whole issue, but maybe we can do more to prevent cryptic crashes like this in the future? Since Python 3.11 we can raise ExceptionGroups. This might be a good place to raise all of the errors that each storage engine encountered, and to conclude with the final message not recognized as any installed format: 'hdf5', 'fs'
, so that none of the errors are swallowed and the user can clearly see whether the engine failed intentionally, or an error like this occurred. We should then probably check what the most common failures look like for each engine, and raise descriptive errors for each, like "This is not an HDF5 file" or "This file does not exist".
Running the following python script:
with MPI (the more core the more likely it is to happen), produce the following stacktrace:
and get stuck. Here, this issue was raised in only one core but can also happen in several cores.