SmileiPIC / Smilei

Particle-in-cell code for plasma simulation
https://smileipic.github.io/Smilei
345 stars 121 forks source link

Opening partly written HDF5 files with happi after simulation crash #655

Closed DoubleAgentDave closed 11 months ago

DoubleAgentDave commented 1 year ago

Unfortunately some of the clusters I am using occasionally crash. This can caused errors in writing to HDF5 files. Using happi I sometimes can't open HDF5 files with incompletely written lines of data. I.E. in a probe data file there are dumptimes that are half written and some bit of data is missing.

For example I recently ran a simulation with a probe diagnostic which writes a HDF5 data set every 4 timesteps. It is quite likely that if the simulation hits the wall time before a write occurs, or if the simulation crashes during a write, then there will be half written data in the file. But the previously written data could still be useful and the simulation may not need to be rerun.

When you use happi to open these files the following error occurs:

signal = Ez_probe.getData()
  File "~/Smilei/happi/_Diagnostics/Diagnostic.py", line 163, in getData
    data.append( self._dataAtTime(t) )
  File "~/Smilei/happi/_Diagnostics/Diagnostic.py", line 862, in _dataLinAtTime
    A = self._getDataAtTime(t)
  File "~/Smilei/happi/_Diagnostics/Probe.py", line 372, in _getDataAtTime
    data = self._dataForTime[t][n,first:last]
TypeError: 'NoneType' object is not subscriptable

When I look at the file in something like HDFCompass I can see that there are two empty data lines, but the rest of the data is intact:

hdfcompass

I think it would be relatively simple to use a try-except statement somewhere to obtain this data in this scenario. It would likely be relatively easy to simulate this problem using a correctly written HDF5 file and adding a couple of unexpected data sets at the end of the file.

DoubleAgentDave commented 1 year ago

Admittedly the 'flush_every' function helps reduce the chance of the HDF5 file being corrupted during a write while it's running, so part of what I said is not quite right, but the problem still exists that some bits of data at the end of the HDF5 files is miswritten sometimes during a crash.

mccoys commented 1 year ago

Thank you for suggesting this. I actually had the same comment a few days ago from a colleague.

DoubleAgentDave commented 1 year ago

Sorry, bad code in previous thing, this recreates the probes hdf5 file (at least I think it does) and seems to work all of the time as far as I have tested:

`

import h5py
f_dest = h5py.File("Probes0_fixed.h5", "w")
f_src = h5py.File("Probes0.h5", "r")
for key in f_src:
    try:
        f_dest.create_dataset_like(str(key), f_src[key])
        f_dest[key][()] = f_src[key][()]
        for attrib in f_src[key].attrs.keys():
            f_dest[key].attrs.create(attrib, f_src[key].attrs[attrib])

    except KeyError:
        print("faulty key = " + str(key))

for attrib in f_src.attrs.keys():
    f_dest.attrs.create(attrib, f_src.attrs[attrib])

`

DoubleAgentDave commented 1 year ago

Just to note when I use the above script it doesn't always work. The individual attributes must also be tested before the key is written to the new H5 file as sometimes a key can be correctly created but not filled with attributes correctly.

mccoys commented 1 year ago

Do you have an idea of to reproduce this? I cannot get a corrupted file

DoubleAgentDave commented 1 year ago

I just ran a simulation that got cut off in the middle. if time it right so that the walltime is midway through a large output it can happen. The most reliable way I found it to happen is if you have very frequent dumps, I had a probe diagnostic that was recording every 4 timesteps and then dumping every 250 timesteps. If the simulation gets cutoff during a write I had a 50% chance the file had a couple of crappy outputs that hadn't been written properly.

e.g.: DiagProbe( every=4, number=[10], origin=[0.0 + 5.0 dx], corners=[[Lsim - 5.0 dx]],

fields=["Ex","Ey","Ez","Bx","By","Bz","Rho_ion","Rho_eon","Jx_eon","Jy_eon","Jz_eon","Jx_ion","Jy_ion","Jz_ion","Jx","Jy","Jz" ], flush_every=outputtime / 10, )

On Tue, 28 Nov 2023 at 12:11, mccoys @.***> wrote:

Do you have an idea of to reproduce this? I cannot get a corrupted file

— Reply to this email directly, view it on GitHub https://github.com/SmileiPIC/Smilei/issues/655#issuecomment-1829605796, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALNT42FRGZQ62HZA6PO4RBDYGXBEZAVCNFSM6AAAAAA5FW2MKCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRZGYYDKNZZGY . You are receiving this because you authored the thread.Message ID: @.***>

mccoys commented 1 year ago

It does not happen on my system for some reason. Would you be able to produce a small example and send it with dropbox or equivalent?

DoubleAgentDave commented 1 year ago

Sure, I'll try finding one tomorrow

On Tue, 28 Nov 2023, 18:12 mccoys, @.***> wrote:

It does not happen on my system for some reason. Would you be able to produce a small example and send it with dropbox or equivalent?

— Reply to this email directly, view it on GitHub https://github.com/SmileiPIC/Smilei/issues/655#issuecomment-1830323443, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALNT42GZQIUCYE45HQMM3DDYGYLQZAVCNFSM6AAAAAA5FW2MKCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZQGMZDGNBUGM . You are receiving this because you authored the thread.Message ID: @.***>

DoubleAgentDave commented 1 year ago

It does not happen on my system for some reason. Would you be able to produce a small example and send it with dropbox or equivalent?

I've sent you a link in element chat

mccoys commented 1 year ago

I have not received it. My name on element is fredpz

DoubleAgentDave commented 1 year ago

Dammit

On Thu, 30 Nov 2023, 15:24 mccoys, @.***> wrote:

I have not received it. My name on element is fredpz

— Reply to this email directly, view it on GitHub https://github.com/SmileiPIC/Smilei/issues/655#issuecomment-1833876221, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALNT42CKR6ZSXZA354G22HTYHCJIJAVCNFSM6AAAAAA5FW2MKCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZTHA3TMMRSGE . You are receiving this because you authored the thread.Message ID: @.***>

DoubleAgentDave commented 1 year ago

I'll send again tomorrow, I can't get access right now

On Thu, 30 Nov 2023, 15:27 David Blackman, @.***> wrote:

Dammit

On Thu, 30 Nov 2023, 15:24 mccoys, @.***> wrote:

I have not received it. My name on element is fredpz

— Reply to this email directly, view it on GitHub https://github.com/SmileiPIC/Smilei/issues/655#issuecomment-1833876221, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALNT42CKR6ZSXZA354G22HTYHCJIJAVCNFSM6AAAAAA5FW2MKCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZTHA3TMMRSGE . You are receiving this because you authored the thread.Message ID: @.***>

DoubleAgentDave commented 1 year ago

ok, sent again, hopefully right person this time :)

mccoys commented 12 months ago

I made a change for happi in the develop branch. Could you test it?

DoubleAgentDave commented 11 months ago

Yes, that seems to allow me to access files which I couldn't before, thanks! That's eliminated a step which was quite annoying and will save me some time too, really appreciated!

DoubleAgentDave commented 11 months ago

Just to be clear, with the old version I tried to access a broken probes0.h5 file and got this error:

Traceback (most recent call last): File "/usr/lib/python3.10/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, *kwds)) File "/usr/lib/python3.10/multiprocessing/pool.py", line 48, in mapstar return list(map(args)) File "/media/david_blackman/left_external/broadband/simulation_results/bandwidth/redone_diags/fixed_ions/narrow/../../../../py1D/make_signal_files.py", line 193, in start signal = Ez_probe.getData() File "/home/david_blackman/codes/smilei_new/Smilei/happi/_Diagnostics/Diagnostic.py", line 163, ingetData data.append( self._dataAtTime(t) ) File "/home/david_blackman/codes/smilei_new/Smilei/happi/_Diagnostics/Diagnostic.py", line 862, in_dataLinAtTime A = self._getDataAtTime(t) File "/home/david_blackman/codes/smilei_new/Smilei/happi/_Diagnostics/Probe.py", line 372, in _getDataAtTime data = self._dataForTime[t][n,first:last] TypeError: 'NoneType' object is not subscriptable

Now I get no error and successfully build up my probe signals so I can process them properly.!