Closed Anton-Le closed 2 years ago
I can confirm that there are issues observed on our dev system as well.
Temporarily I can circumvent this issue in my simulations, by reverting to a PIConGPU version which permits ADIOS1 usage, e.g. 3bab80f97b609c7c29931952d11cef6c8d5765ad
So my earlier comment about observed was maybe not related to this issue, but is some other issue. I observed checkpointing failing on fwk394 with openPMD 0.12.0-alpha. It appears the exception was spawned here with message Chunks cannot be written for a constant RecordComponent
. It occured when writing the first field E, particles were supposed to be done later. When I switched to 0.14.2 version, the same setup runs fine. I think now it should be using the other specialization of openPMDSpan
? @franzpoeschel do you know if we still need both specializations, and if so is the default one correct?
I will now try on Hemera to reproduce exactly the issue Anton reported
@Anton-Le sorry it took a bit long. I've now tried your setup on Hemera fwkt_v100 partition with its adios2/2.7.1-cuda112
and openpmd/0.13.2-cuda112-adios271
modules (standard in our profile in dev
for that partition), and it seems to work.
Weird. I will re-try it.
I am still baffled with this on JWB. An updated simulation using 326ff5875fa58b819d3df707a89129baddddbf21
has produced the same (or similar) errors:
Unhandled exception of type 'St13runtime_error' with message 'Internal error: Encountered unknown datatype (switchType) ->35', terminating
[... repetitions omitted for brevity ...]
[cupla] Error: </p/project/pwfaradiation/lebedev1/picsrc/picongpu_10102021/include/pmacc/../pmacc/memory/buffers/HostBufferIntern.hpp>:71
[cupla] Error: </p/project/pwfaradiation/lebedev1/picsrc/picongpu_10102021/include/pmacc/../pmacc/memory/buffers/Buffer.hpp>:69
[cupla] Error: </p/project/pwfaradiation/lebedev1/picsrc/picongpu_10102021/include/pmacc/../pmacc/memory/buffers/Buffer.hpp>:69
[cupla] Error: </p/project/pwfaradiation/lebedev1/picsrc/picongpu_10102021/include/pmacc/../pmacc/memory/buffers/Buffer.hpp>:69
The simulation is still the same one that has been running with a "pre-openPMD radiation" version of PIC, just with a modification of the electron temperature and a recompilation using the new PIConGPU.
The folowing modules have been used
module load GCC/9.3.0
module load CUDA/11.0
module load CMake/3.18.0
module load ParaStationMPI/5.4.7-1
module load Python/3.8.5
module load Boost/1.74.0
module load HDF5/1.10.6
to compile the required libraries: ADIOS 2: 2.7.1.436 , BLOSC: 1.21.0 (also 1.15.0), OPENPMD: 0.14.2 , PNGWRITER: 0.7.0.
I seem to have missed this issue somehow
I think now it should be using the other specialization of
openPMDSpan
? @franzpoeschel do you know if we still need both specializations, and if so is the default one correct?
We need both specializations as long as we support openPMD 0.12.* The plan is to soon bump the minimum required version to the upcoming 0.14.3 (see yesterday's dev meeting)
Datatype 35 is bool
in openPMD-api 0.14.2, so somewhere your simulation tries to deal with booleans. It would be interesting to see where exactly this is happening, can you figure out the stack trace of where the error is thrown?
unsigned char
for this purpose and you should not see this error. Bugs can happen though.Otherwise, does anyone have an idea what could be making PIConGPU try to output bools here?
Ok. I can try and figure the point out today - although I'm not particularly optimistic. In the meantime the simulation will run with an older version.
stderr.txt Please find the (abridged for repetitions) standard error output attached.
The error does not appear to be specific to checkpointing - it just appears first when writing checkpoints due to the latter being the first large non-textual outputs. I tried moving the first checkpoint to 100 steps after the first radiation output and got the same error.
@franzpoeschel Thanks to your suggestion and quite a few variations of the runtime configuration I think I have pinned the issue to the gammaFilter
of the radiation module!
According to @PrometheusPi the filter includes particles in the computation of the radiationonce they pass a user-defined threshold for \gamma once - which is a bool
.
Removing the property from the particle definitions appears to enable storage of particles and writing a checkpoint, something that failed with the radiationMask
.
Thanks for investigating @Anton-Le . Following your message I also checked and in radiation (and i guess transition radiation) simulations we indeed are trying to write a dataset of bool
. Since this is not supported by ADIOS2 according to @franzpoeschel , we should probably change this dataset to char
? Either the particle attribute alltogether, or convert at the time of writing.
I think this is what makes a lot of sense to do before the release, and so to backport this change to the release candidate branch. I can do the change if necessary.
@sbastrakov , @franzpoeschel In my opinion the conversion to and from an ADIOS2-conformant data format should happen transparently to the user at storage/retrieval time. The definition of a mask as a bool
appears sufficiently intuitive to me to not warrant a change to a different data type.
Okay, let me try to do it this way first. Hopefully what i'm thinking about works
Okay, hopefully the linked PR fixes it. We also need to backport the fix to the release candinate branch.
I will test it as soon as JUWELS Booster is back from the dead again (which should, hopefully, be towards the end of today).
I see, you found the issue during my holiday Using chars for this purpose is probably for the best, yeah
@Anton-Le Did @sbastrakov's pull request solve the issue?
I guess you meant to tag @Anton-Le . As far as i can see, there was still some issue observed on JUWELS, not clear if related or not and hard to check now after the allocation ended.
You are right @sbastrakov. I edited/fixed the comment. Thanks for your feedback.
Unfortunately I could not test this on JWB anymore. I will launch the same test on hemera today to check the problem. The problem on JWB was that, although I could write a checkpoint with the fix, a restart from said checkpoint failed.
@Anton-Le in case you are using a moving windonw, it could have been same issue as in #3899 . The fix #3902 is soon to be merged. The bug could have caused not just weird results, but also crashes.
Btw that would have also explained why my small reproducer runs did not show the error, it occurs only after the window have actually slided (i.e. passed at least one local domain size)
That could very well be it! I'm normally using a moving window.
Did you output pngs @Anton-Le? If yes, you can check if the bug was triggered after restart.
Iirc it was crashing during the restart?
Btw now the fix is merged to dev
.
I will close this issue, it should be fixed with #3890 and we do not have any compute time on JUWELS to reproduce the issue.
I have attempted to run a simulation with newest (baf0da494bd4d7e432033dd61242f4efcba1d39d) PIC and openPMD 0.14.1, 0.14.2/dev (040a9b0) on JWB and hemera - to no avail.
The simulations will consistently fail on the first attempt to write data with openPMD. The errors are identical, whether I try to write particle & field data or a checkpoint.
Excerpts from
stderr
for the above commit of PIC:Unhandled exception of type 'St13runtime_error' with message 'Internal error: Encountered unknown datatype (switchType) ->35', terminating
My guess was that the problem lies with the
CompositeBinarySwitchActivationFunctor
, i.e., the functor that allows me to switch from acceleration to actual propagation w/o checkpointing first. However, removing said functor fromspecies.param
andparticle.param
did not solve the issue.These - or largely similar - errors have also been observed on hemera.
Library configurations and config/cmake outputs are attached. LibConfig.md