ComputationalRadiationPhysics / picongpu

Performance-Portable Particle-in-Cell Simulations for the Exascale Era :sparkles:
https://picongpu.readthedocs.io
Other
704 stars 217 forks source link

Execution of 'new' PIConGPU with openPMD fails with `unknown datatype (switchType)` #3732

Closed Anton-Le closed 2 years ago

Anton-Le commented 3 years ago

I have attempted to run a simulation with newest (baf0da494bd4d7e432033dd61242f4efcba1d39d) PIC and openPMD 0.14.1, 0.14.2/dev (040a9b0) on JWB and hemera - to no avail.

The simulations will consistently fail on the first attempt to write data with openPMD. The errors are identical, whether I try to write particle & field data or a checkpoint.

Excerpts from stderr for the above commit of PIC: Unhandled exception of type 'St13runtime_error' with message 'Internal error: Encountered unknown datatype (switchType) ->35', terminating

[cupla] Error: </p/project/pwfaradiation/lebedev1/PIConGPU_SourcesAndLibs_Aug21/picongpu-dev/include/pmacc/../pmacc/memory/buffers/HostBufferIntern.hpp>:71 
[cupla] Error: </p/project/pwfaradiation/lebedev1/PIConGPU_SourcesAndLibs_Aug21/picongpu-dev/include/pmacc/../pmacc/memory/buffers/Buffer.hpp>:69
[cupla] Error: </p/project/pwfaradiation/lebedev1/PIConGPU_SourcesAndLibs_Aug21/picongpu-dev/include/pmacc/../pmacc/me
mory/buffers/Buffer.hpp>:69 
 6 0x0000000000e9f768 _ZThn56_N8picongpu9ParticlesIN5pmacc4meta6StringIJLc119ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0E
Lc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0EEEEN5boost3mpl6v_itemINS_11chargeRatioINS_20ChargeRatioElectronsENS1_13pmacc_isAliasEEENS7_INS_9massRatioINS_18MassRatioElectronsESA_EENS7_INS_12densityRatioINS_17DensityRatioBunchESA_EENS7_INS_7currentINS_13currentSolver3EmZINS_9particles6shapes3TSCENSJ_8strategy16CachedSupercellsEEESA_EENS7_INS_13interpolationINS_28FieldToParticleInterpolationISN_NS_30AssignedTrilinearInterpolationEEESA_EENS7_INS_5shapeISN_SA_EENS7_INS_14particlePusherINSL_6pusher9CompositeINS10_12AccelerationENS10_5BorisENS10_38CompositeBinarySwitchActivationFunctorILj22838EEEEESA_EENS6_7vector0IN4mpl_2naEEELi0EEELi0EEELi0EEELi0EEELi0EEELi0EEELi0EEENS7_INS_13radiationMaskENS7_INS_13momentumPrev1ENS7_INS_9weightingENS7_INS_8momentumENS7_INS_8positionINS_12position_picESA_EES1B_Li0EEELi0EEELi0EEELi0EEELi0EEEED0Ev()  ???:0

My guess was that the problem lies with the CompositeBinarySwitchActivationFunctor, i.e., the functor that allows me to switch from acceleration to actual propagation w/o checkpointing first. However, removing said functor from species.param and particle.param did not solve the issue.

These - or largely similar - errors have also been observed on hemera.

Library configurations and config/cmake outputs are attached. LibConfig.md

sbastrakov commented 3 years ago

I can confirm that there are issues observed on our dev system as well.

Anton-Le commented 3 years ago

Temporarily I can circumvent this issue in my simulations, by reverting to a PIConGPU version which permits ADIOS1 usage, e.g. 3bab80f97b609c7c29931952d11cef6c8d5765ad

sbastrakov commented 3 years ago

So my earlier comment about observed was maybe not related to this issue, but is some other issue. I observed checkpointing failing on fwk394 with openPMD 0.12.0-alpha. It appears the exception was spawned here with message Chunks cannot be written for a constant RecordComponent. It occured when writing the first field E, particles were supposed to be done later. When I switched to 0.14.2 version, the same setup runs fine. I think now it should be using the other specialization of openPMDSpan? @franzpoeschel do you know if we still need both specializations, and if so is the default one correct?

I will now try on Hemera to reproduce exactly the issue Anton reported

sbastrakov commented 3 years ago

@Anton-Le sorry it took a bit long. I've now tried your setup on Hemera fwkt_v100 partition with its adios2/2.7.1-cuda112 and openpmd/0.13.2-cuda112-adios271 modules (standard in our profile in dev for that partition), and it seems to work.

Anton-Le commented 3 years ago

Weird. I will re-try it.

Anton-Le commented 3 years ago

I am still baffled with this on JWB. An updated simulation using 326ff5875fa58b819d3df707a89129baddddbf21 has produced the same (or similar) errors:

Unhandled exception of type 'St13runtime_error' with message 'Internal error: Encountered unknown datatype (switchType) ->35', terminating
[... repetitions omitted for brevity ...]
[cupla] Error: </p/project/pwfaradiation/lebedev1/picsrc/picongpu_10102021/include/pmacc/../pmacc/memory/buffers/HostBufferIntern.hpp>:71 
[cupla] Error: </p/project/pwfaradiation/lebedev1/picsrc/picongpu_10102021/include/pmacc/../pmacc/memory/buffers/Buffer.hpp>:69 
[cupla] Error: </p/project/pwfaradiation/lebedev1/picsrc/picongpu_10102021/include/pmacc/../pmacc/memory/buffers/Buffer.hpp>:69 
[cupla] Error: </p/project/pwfaradiation/lebedev1/picsrc/picongpu_10102021/include/pmacc/../pmacc/memory/buffers/Buffer.hpp>:69 

The simulation is still the same one that has been running with a "pre-openPMD radiation" version of PIC, just with a modification of the electron temperature and a recompilation using the new PIConGPU.

The folowing modules have been used

module load GCC/9.3.0
module load CUDA/11.0
module load CMake/3.18.0
module load ParaStationMPI/5.4.7-1
module load Python/3.8.5
module load Boost/1.74.0
module load HDF5/1.10.6

to compile the required libraries: ADIOS 2: 2.7.1.436 , BLOSC: 1.21.0 (also 1.15.0), OPENPMD: 0.14.2 , PNGWRITER: 0.7.0.

franzpoeschel commented 3 years ago

I seem to have missed this issue somehow

I think now it should be using the other specialization of openPMDSpan? @franzpoeschel do you know if we still need both specializations, and if so is the default one correct?

We need both specializations as long as we support openPMD 0.12.* The plan is to soon bump the minimum required version to the upcoming 0.14.3 (see yesterday's dev meeting)

Datatype 35 is bool in openPMD-api 0.14.2, so somewhere your simulation tries to deal with booleans. It would be interesting to see where exactly this is happening, can you figure out the stack trace of where the error is thrown?

Otherwise, does anyone have an idea what could be making PIConGPU try to output bools here?

Anton-Le commented 3 years ago

Ok. I can try and figure the point out today - although I'm not particularly optimistic. In the meantime the simulation will run with an older version.

Anton-Le commented 3 years ago

stderr.txt Please find the (abridged for repetitions) standard error output attached.

The error does not appear to be specific to checkpointing - it just appears first when writing checkpoints due to the latter being the first large non-textual outputs. I tried moving the first checkpoint to 100 steps after the first radiation output and got the same error.

Anton-Le commented 3 years ago

@franzpoeschel Thanks to your suggestion and quite a few variations of the runtime configuration I think I have pinned the issue to the gammaFilter of the radiation module! According to @PrometheusPi the filter includes particles in the computation of the radiationonce they pass a user-defined threshold for \gamma once - which is a bool.

Removing the property from the particle definitions appears to enable storage of particles and writing a checkpoint, something that failed with the radiationMask.

sbastrakov commented 3 years ago

Thanks for investigating @Anton-Le . Following your message I also checked and in radiation (and i guess transition radiation) simulations we indeed are trying to write a dataset of bool. Since this is not supported by ADIOS2 according to @franzpoeschel , we should probably change this dataset to char? Either the particle attribute alltogether, or convert at the time of writing.

sbastrakov commented 3 years ago

I think this is what makes a lot of sense to do before the release, and so to backport this change to the release candidate branch. I can do the change if necessary.

Anton-Le commented 3 years ago

@sbastrakov , @franzpoeschel In my opinion the conversion to and from an ADIOS2-conformant data format should happen transparently to the user at storage/retrieval time. The definition of a mask as a bool appears sufficiently intuitive to me to not warrant a change to a different data type.

sbastrakov commented 3 years ago

Okay, let me try to do it this way first. Hopefully what i'm thinking about works

sbastrakov commented 3 years ago

Okay, hopefully the linked PR fixes it. We also need to backport the fix to the release candinate branch.

Anton-Le commented 3 years ago

I will test it as soon as JUWELS Booster is back from the dead again (which should, hopefully, be towards the end of today).

franzpoeschel commented 3 years ago

I see, you found the issue during my holiday Using chars for this purpose is probably for the best, yeah

PrometheusPi commented 3 years ago

@Anton-Le Did @sbastrakov's pull request solve the issue?

sbastrakov commented 3 years ago

I guess you meant to tag @Anton-Le . As far as i can see, there was still some issue observed on JUWELS, not clear if related or not and hard to check now after the allocation ended.

PrometheusPi commented 3 years ago

You are right @sbastrakov. I edited/fixed the comment. Thanks for your feedback.

Anton-Le commented 3 years ago

Unfortunately I could not test this on JWB anymore. I will launch the same test on hemera today to check the problem. The problem on JWB was that, although I could write a checkpoint with the fix, a restart from said checkpoint failed.

sbastrakov commented 3 years ago

@Anton-Le in case you are using a moving windonw, it could have been same issue as in #3899 . The fix #3902 is soon to be merged. The bug could have caused not just weird results, but also crashes.

sbastrakov commented 3 years ago

Btw that would have also explained why my small reproducer runs did not show the error, it occurs only after the window have actually slided (i.e. passed at least one local domain size)

Anton-Le commented 3 years ago

That could very well be it! I'm normally using a moving window.

PrometheusPi commented 3 years ago

Did you output pngs @Anton-Le? If yes, you can check if the bug was triggered after restart.

sbastrakov commented 3 years ago

Iirc it was crashing during the restart? Btw now the fix is merged to dev.

psychocoderHPC commented 2 years ago

I will close this issue, it should be fixed with #3890 and we do not have any compute time on JUWELS to reproduce the issue.