ADicksonLab / wepy

Weighted Ensemble simulation framework in Python
https://adicksonlab.github.io/wepy/index.html
MIT License
48 stars 20 forks source link

Runtime error in WepyHDF5 #112

Open SamikBose opened 8 months ago

SamikBose commented 8 months ago

Hi @salotz and @alexrd,

We (I and Ceren) are having this weird issue with WepyHDF5, where in the middle of a wepy simulation (say after 100 or 200 cycles), all of a sudden we are getting this error:

Traceback (most recent call last):
  File "we_rebinding_rst4.py", line 153, in <module>
    steps_list)
  File "<boltons.funcutils.FunctionBuilder-7>", line 2, in run_simulation
  File "/home/bosesami/anaconda3/envs/wepy_new/lib/python3.7/site-packages/eliot/_action.py", line 943, in logging_wrapper
    result = wrapped_function(*args, **kwargs)
  File "/home/bosesami/software/wepy/src/wepy/sim_manager.py", line 743, in run_simulation
    self.init(num_workers=num_workers)
  File "<boltons.funcutils.FunctionBuilder-6>", line 2, in init
  File "/home/bosesami/anaconda3/envs/wepy_new/lib/python3.7/site-packages/eliot/_action.py", line 943, in logging_wrapper
    result = wrapped_function(*args, **kwargs)
  File "/home/bosesami/software/wepy/src/wepy/sim_manager.py", line 599, in init
    continue_run=continue_run)
  File "/home/bosesami/software/wepy/src/wepy/reporter/hdf5.py", line 363, in init
    alt_reps=self.alt_reps_idxs)
  File "/home/bosesami/software/wepy/src/wepy/hdf5.py", line 846, in __init__
    libver=H5PY_LIBVER, swmr=self._swmr_mode) as h5:
  File "/home/bosesami/anaconda3/envs/wepy_new/lib/python3.7/site-packages/h5py/_hl/files.py", line 408, in __init__
    swmr=swmr)
  File "/home/bosesami/anaconda3/envs/wepy_new/lib/python3.7/site-packages/h5py/_hl/files.py", line 177, in make_fid
    fid = h5f.create(name, h5f.ACC_EXCL, fapl=fapl, fcpl=fcpl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 108, in h5py.h5f.create
OSError: Unable to create file (unable to open file: name = '/dickson/s1/bosesami/comp_unbinding/lig19/vac_cavity_rebinding/sim_data/output_10000_1000_4_rst5/wepy.results.h5', errno = 17, error message = 'File exists', flags = 15, o_flags = c2)
Exception ignored in: <function WepyHDF5.__del__ at 0x7fced6ba8b90>
Traceback (most recent call last):
  File "/home/bosesami/software/wepy/src/wepy/hdf5.py", line 913, in __del__
    self.close()
  File "/home/bosesami/software/wepy/src/wepy/hdf5.py", line 2436, in close
    if not self.closed:
AttributeError: 'WepyHDF5' object has no attribute 'closed'

Please note this is not a case of output h5 file already existing, as the error is coming up in the middle (like after 100 or 200 cycles) of the simulation (when the file is already created). I don't understand why it is trying to create the same file all of a sudden after 100 cycles. Does it seem like there is an attempt to restart the simulation (hardware issues). This is not a consistent error. Comes up once in ~10 or 15 simulations.

alexrd commented 8 months ago

That is strange. From this line:

File "/home/bosesami/software/wepy/src/wepy/sim_manager.py", line 743, in run_simulation
self.init(num_workers=num_workers)

it seems like it is initializing the sim_manager. I'm assuming that you only do this once in your code, so I don't know why it would ever do it again. Maybe the cluster is attempting to restart your job after a crash or something like that? What do your log files look like (e.g. STDOUT and STDERR)? Do they show any messages twice? And do they contain information from the first 100-200 cycles?

SamikBose commented 8 months ago

Neither of stdout and stderr has double informations. But I observed something strange. The walker_pkl reporter that stores the pkl of last two cycles has three items (screenshot attached). The final cycle (before crashing) has nothing stored though.

Screenshot 2023-10-20 at 2 03 05 PM

salotz commented 8 months ago

I would echo what Alex is saying the init method should only be called once. I would want to see your script for running and you can attach the full logs here in a file to help out.

And just a note on the traceback the real error here is:

OSError: Unable to create file (unable to open file: name = '/dickson/s1/bosesami/comp_unbinding/lig19/vac_cavity_rebinding/sim_data/output_10000_1000_4_rst5/wepy.results.h5', errno = 17, error message = 'File exists', flags = 15, o_flags = c2)

The final error:

AttributeError: 'WepyHDF5' object has no attribute 'closed'

is a result of using the context manager and it trying to clean up something that wasn't initialized properly. This is pretty common so it would probably be a good idea to make that cleanup more intelligent about how it reports this. I will make another issue for this.

If you are doing something fancy in between cycles I would recommend just calling run_cycle yourself (https://adicksonlab.github.io/wepy/_api/wepy.sim_manager.html?highlight=run_simulation_by_time#wepy.sim_manager.Manager.run_cycle) you can look at the code for run_simulation* to see what needs to be done.

As for the checkpoints, its been a while since I looked at that, and now I would actually just recommend writing out the walker states to a normal trajectory file and weights to something like a JSON file, as the walker pickles can be easily corrupted. But in any case they will only be written if the cycle actually succeeds and all reporters before it ran successfully. So if the HDF5Reporter fails and the checkpoint reporter is run after it there will be no output. This is good if you are using checkpoints to do simulation continuations or restarts since then you would have a gap in cycles if the HDF5 doesn't contain the last cycle the checkpoint has. If you do want this to happen you should put the checkpoint reporter before the HDF5 reporter in the list of reporters you give the sim manager as these are run in the order specified there.