Open SamikBose opened 1 year ago
That is strange. From this line:
File "/home/bosesami/software/wepy/src/wepy/sim_manager.py", line 743, in run_simulation
self.init(num_workers=num_workers)
it seems like it is initializing the sim_manager
. I'm assuming that you only do this once in your code, so I don't know why it would ever do it again. Maybe the cluster is attempting to restart your job after a crash or something like that? What do your log files look like (e.g. STDOUT and STDERR)? Do they show any messages twice? And do they contain information from the first 100-200 cycles?
Neither of stdout and stderr has double informations. But I observed something strange. The walker_pkl reporter that stores the pkl of last two cycles has three items (screenshot attached). The final cycle (before crashing) has nothing stored though.
I would echo what Alex is saying the init
method should only be called once. I would want to see your script for running and you can attach the full logs here in a file to help out.
And just a note on the traceback the real error here is:
OSError: Unable to create file (unable to open file: name = '/dickson/s1/bosesami/comp_unbinding/lig19/vac_cavity_rebinding/sim_data/output_10000_1000_4_rst5/wepy.results.h5', errno = 17, error message = 'File exists', flags = 15, o_flags = c2)
The final error:
AttributeError: 'WepyHDF5' object has no attribute 'closed'
is a result of using the context manager and it trying to clean up something that wasn't initialized properly. This is pretty common so it would probably be a good idea to make that cleanup more intelligent about how it reports this. I will make another issue for this.
If you are doing something fancy in between cycles I would recommend just calling run_cycle
yourself (https://adicksonlab.github.io/wepy/_api/wepy.sim_manager.html?highlight=run_simulation_by_time#wepy.sim_manager.Manager.run_cycle) you can look at the code for run_simulation*
to see what needs to be done.
As for the checkpoints, its been a while since I looked at that, and now I would actually just recommend writing out the walker states to a normal trajectory file and weights to something like a JSON file, as the walker pickles can be easily corrupted. But in any case they will only be written if the cycle actually succeeds and all reporters before it ran successfully. So if the HDF5Reporter fails and the checkpoint reporter is run after it there will be no output. This is good if you are using checkpoints to do simulation continuations or restarts since then you would have a gap in cycles if the HDF5 doesn't contain the last cycle the checkpoint has. If you do want this to happen you should put the checkpoint reporter before the HDF5 reporter in the list of reporters you give the sim manager as these are run in the order specified there.
Hi @salotz and @alexrd,
We (I and Ceren) are having this weird issue with WepyHDF5, where in the middle of a wepy simulation (say after 100 or 200 cycles), all of a sudden we are getting this error:
Please note this is not a case of output h5 file already existing, as the error is coming up in the middle (like after 100 or 200 cycles) of the simulation (when the file is already created). I don't understand why it is trying to create the same file all of a sudden after 100 cycles. Does it seem like there is an attempt to restart the simulation (hardware issues). This is not a consistent error. Comes up once in ~10 or 15 simulations.