ComputationalRadiationPhysics / isaac

In Situ Animation of Accelerated Computations :microscope:
http://ComputationalRadiationPhysics.github.io/isaac/
GNU Lesser General Public License v3.0
24 stars 15 forks source link

ISAAC plugin exits with segmentation fault #131

Open benjha opened 3 years ago

benjha commented 3 years ago

Hi @PrometheusPi @psychocoderHPC,

After several unsuccessful attempts to get some traces out with TAU, I ran PIConGPU &ISAAC in a default configuration (profiling off, dumping viz. frames to Alpine, 1000 steps with checkpoint.restart.loop=3, using the /etc/picongpu/8_isaac.cfg file) and noted the simulation breaks with the next errors at the end of its execution, which is the cause TAU can't generate the traces:

[h09n09:151879] *** Process received signal ***
[h09n09:151879] Signal: Segmentation fault (11)
[h09n09:151879] Signal code: Address not mapped (1)
[h09n09:151879] Failing at address: 0x3be700000008
[h09n09:151879] [ 0] [d22n15:170622] *** Process received signal ***
[d22n15:170622] Signal: Segmentation fault (11)
[d22n15:170622] Signal code: Address not mapped (1)
[d22n15:170622] Failing at address: 0x19f800000008
[h09n09:151880] *** Process received signal ***
[h09n09:151880] Signal: Segmentation fault (11)
[h09n09:151880] Signal code: Address not mapped (1)
[h09n09:151880] Failing at address: 0x19fe00000008
[h09n09:151880] [ 0] [0x2000000504d8]
[h09n09:151880] [ 1] [0x2000000504d8]
[h09n09:151879] [ 1] /gpfs/alpine/proj-shared/csc434/benjha/picongpu-simulations/LWFA_ISAAC_perf/lwfa_isaac_1280x720/input/bin/picongpu(_ZN8picongpu6isaacP11IsaacPlugin12pluginUnloadEv+0x98)[0x1043e4d8]
[h09n09:151879] [ 2] [d22n15:170623] *** Process received signal ***
[d22n15:170623] Signal: Segmentation fault (11)
[d22n15:170623] Signal code: Address not mapped (1)
[d22n15:170623] Failing at address: 0x3be800000008
/gpfs/alpine/proj-shared/csc434/benjha/picongpu-simulations/LWFA_ISAAC_perf/lwfa_isaac_1280x720/input/bin/picongpu(_ZN8picongpu6isaacP11IsaacPlugin12pluginUnloadEv+0x98)[0x1043e4d8]
[h09n09:151880] [ 2] /gpfs/alpine/proj-shared/csc434/benjha/picongpu-simulations/LWFA_ISAAC_perf/lwfa_isaac_1280x720/input/bin/picongpu(_ZN8picongpu17SimulationStarterINS_21InitialiserControllerENS_16PluginControllerENS_12MySimulationEE12pluginUnloadEv+0xb8)[0x10369338]
[h09n09:151880] [ 3] [d22n15:170622] [ 0] [0x2000000504d8]
[d22n15:170622] [ 1] /gpfs/alpine/proj-shared/csc434/benjha/picongpu-simulations/LWFA_ISAAC_perf/lwfa_isaac_1280x720/input/bin/picongpu(_ZN8picongpu17SimulationStarterINS_21InitialiserControllerENS_16PluginControllerENS_12MySimulationEE12pluginUnloadEv+0xb8)[0x10369338]
[h09n09:151879] [ 3] /gpfs/alpine/proj-shared/csc434/benjha/picongpu-simulations/LWFA_ISAAC_perf/lwfa_isaac_1280x720/input/bin/picongpu(_ZN63_GLOBAL__N__39_tmpxft_000119dd_00000000_6_main_cpp1_ii_5586f50813runSimulationEiPPc+0x664)[0x1030b514]
[h09n09:151879] [ 4] [d22n15:170623] [ 0] [0x2000000504d8]
[d22n15:170623] [ 1] /gpfs/alpine/proj-shared/csc434/benjha/picongpu-simulations/LWFA_ISAAC_perf/lwfa_isaac_1280x720/input/bin/picongpu(_ZN8picongpu6isaacP11IsaacPlugin12pluginUnloadEv+0x98)[0x1043e4d8]
[d22n15:170623] [ 2] /gpfs/alpine/proj-shared/csc434/benjha/picongpu-simulations/LWFA_ISAAC_perf/lwfa_isaac_1280x720/input/bin/picongpu(_ZN8picongpu6isaacP11IsaacPlugin12pluginUnloadEv+0x98)[0x1043e4d8]
[d22n15:170622] [ 2] /gpfs/alpine/proj-shared/csc434/benjha/picongpu-simulations/LWFA_ISAAC_perf/lwfa_isaac_1280x720/input/bin/picongpu(_ZN8picongpu17SimulationStarterINS_21InitialiserControllerENS_16PluginControllerENS_12MySimulationEE12pluginUnloadEv+0xb8)[0x10369338]
[d22n15:170622] [ 3] /gpfs/alpine/proj-shared/csc434/benjha/picongpu-simulations/LWFA_ISAAC_perf/lwfa_isaac_1280x720/input/bin/picongpu(_ZN63_GLOBAL__N__39_tmpxft_000119dd_00000000_6_main_cpp1_ii_5586f50813runSimulationEiPPc+0x664)[0x1030b514]
[d22n15:170622] [ 4] /gpfs/alpine/proj-shared/csc434/benjha/picongpu-simulations/LWFA_ISAAC_perf/lwfa_isaac_1280x720/input/bin/picongpu(_ZN63_GLOBAL__N__39_tmpxft_000119dd_00000000_6_main_cpp1_ii_5586f50813runSimulationEiPPc+0x664)[0x1030b514]
[h09n09:151880] [ 4] /gpfs/alpine/proj-shared/csc434/benjha/picongpu-simulations/LWFA_ISAAC_perf/lwfa_isaac_1280x720/input/bin/picongpu(main+0x1c)[0x102f931c]
[h09n09:151880] [ 5] /lib64/libc.so.6(+0x25200)[0x200001095200]
[h09n09:151880] [ 6] /gpfs/alpine/proj-shared/csc434/benjha/picongpu-simulations/LWFA_ISAAC_perf/lwfa_isaac_1280x720/input/bin/picongpu(_ZN8picongpu17SimulationStarterINS_21InitialiserControllerENS_16PluginControllerENS_12MySimulationEE12pluginUnloadEv+0xb8)[0x10369338]
[d22n15:170623] [ 3] /gpfs/alpine/proj-shared/csc434/benjha/picongpu-simulations/LWFA_ISAAC_perf/lwfa_isaac_1280x720/input/bin/picongpu(_ZN63_GLOBAL__N__39_tmpxft_000119dd_00000000_6_main_cpp1_ii_5586f50813runSimulationEiPPc+0x664)[0x1030b514]
[d22n15:170623] [ 4] /gpfs/alpine/proj-shared/csc434/benjha/picongpu-simulations/LWFA_ISAAC_perf/lwfa_isaac_1280x720/input/bin/picongpu(main+0x1c)[0x102f931c]
[d22n15:170623] [ 5] /lib64/libc.so.6(+0x25200)[0x200001095200]
[d22n15:170623] [ 6] /gpfs/alpine/proj-shared/csc434/benjha/picongpu-simulations/LWFA_ISAAC_perf/lwfa_isaac_1280x720/input/bin/picongpu(main+0x1c)[0x102f931c]
[h09n09:151879] [ 5] /lib64/libc.so.6(+0x25200)[0x200001095200]
[h09n09:151879] [ 6] /lib64/libc.so.6(__libc_start_main+0xc4)[0x2000010953f4]
[h09n09:151879] *** End of error message ***
/gpfs/alpine/proj-shared/csc434/benjha/picongpu-simulations/LWFA_ISAAC_perf/lwfa_isaac_1280x720/input/bin/picongpu(main+0x1c)[0x102f931c]
[d22n15:170622] [ 5] /lib64/libc.so.6(+0x25200)[0x200001095200]
[d22n15:170622] [ 6] /lib64/libc.so.6(__libc_start_main+0xc4)[0x2000010953f4]
[d22n15:170622] *** End of error message ***
/lib64/libc.so.6(__libc_start_main+0xc4)[0x2000010953f4]
[h09n09:151880] *** End of error message ***
/lib64/libc.so.6(__libc_start_main+0xc4)[0x2000010953f4]
[d22n15:170623] *** End of error message ***
ERROR:  One or more process (first noticed rank 6) terminated with signal 11 (core dumped)

Looks like the issue is in the IsaacPlugin.hpp's pluginUnload() method which in turn call the IsaacVisualization destructor.

Can you reproduce this error ?

PrometheusPi commented 3 years ago

@benjha Thanks for reporting the error. Since we are currently pushing out new versions of our software, could you please specify which version you are using that creates the error:

Then we can quickly check whether we are able to reproduce the error on hemera as well.

benjha commented 3 years ago

PIConGPU came from the dev branch dated back to Nov. 2020 with its own Alpaka distribution

commit 84e03980f2a56c7aea24d88bc3be9eb43f1a3197
Merge: aa86f2d c5208f4
Author: Sergei Bastrakov <sergey.bastrakov@gmail.com>
Date:   Wed Nov 25 10:50:46 2020 +0100

ISAAC:

commit 47c475ddd3fcd732964f5ce22edfe2fbcfae2b14
Merge: 3186666 74ab372
Author: Ren<C3><A9> Widera <r.widera@hzdr.de>
Date:   Fri Nov 6 13:30:40 2020 +0100

    Merge pull request #118 from ComputationalRadiationPhysics/dev

    Merge json-rodarae file to latetest release cadidate
PrometheusPi commented 3 years ago

@benjha Thanks for providing the details. I will see whether I can reproduce this bug.

benjha commented 3 years ago

Hi @PrometheusPi

I am installing current PIConGPU dev branch with ISAAC 1.5.2 to verify if they work properly from this case.

I am having a list of these errors:

/gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/gcc_6.4.0/include/isaac.hpp(112): error #135: namespace "alpaka" has no member "Dev"

/gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/gcc_6.4.0/include/isaac.hpp(112): error #65: expected a ";"

/gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/gcc_6.4.0/include/isaac.hpp(113): error #135: namespace "alpaka" has no member "DimInt"

/gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/gcc_6.4.0/include/isaac.hpp(113): error #65: expected a ";"

which likely is an Alpaka version mismatch between the one PIConGPU dev uses and ISAAC uses.

Were there any changes on the way compilation works?

psychocoderHPC commented 3 years ago

Hi @PrometheusPi

I am installing current PIConGPU dev branch with ISAAC 1.5.2 to verify if they work properly from this case.

I am having a list of these errors:

/gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/gcc_6.4.0/include/isaac.hpp(112): error #135: namespace "alpaka" has no member "Dev"

/gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/gcc_6.4.0/include/isaac.hpp(112): error #65: expected a ";"

/gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/gcc_6.4.0/include/isaac.hpp(113): error #135: namespace "alpaka" has no member "DimInt"

/gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/gcc_6.4.0/include/isaac.hpp(113): error #65: expected a ";"

which likely is an Alpaka version mismatch between the one PIConGPU dev uses and ISAAC uses.

Were there any changes on the way compilation works?

Are you sure you used the release 1.5.2 and not the current dev branch? The dev branch of ISAAC is currently incompatible with the PIConGPU dev branch. There is a PR https://github.com/ComputationalRadiationPhysics/picongpu/pull/3498 in PIConGPU to fix it but we need to switch our PIConGPU CI first to the ISAAC dev branch.

The release 1.5.2 is currently checked together with PIConGPU dev.

psychocoderHPC commented 3 years ago

@FelixTUD Could you please test the current dev of PIConGPU together with the release 1.5.2?

benjha commented 3 years ago

I've rechecked dependencies and fixed the Alpaka mismatch issue.

With PIConGPU current dev branch and ISAAC 1.5.2 following the next configuration:

#################################
## Section: Required Variables ##
#################################

TBG_wallTime="0:30:00"

TBG_devices_x=2
TBG_devices_y=2
TBG_devices_z=2

TBG_gridSize="192 2048 160"
TBG_steps="4000"

TBG_restartLoop="--checkpoint.restart.loop 1"

#################################
## Section: Optional Variables ##
#################################

TBG_isaac="--isaac.width 1280 --isaac.height 720 --isaac.period 1  --isaac.name !TBG_jobName  --isaac.url apps.marble.ccs.ornl.gov  --isaac.port 30167"

TBG_plugins="!TBG_isaac"

#################################
## Section: Program Parameters ##
#################################

TBG_deviceDist="!TBG_devices_x !TBG_devices_y !TBG_devices_z"

TBG_programParams="-d !TBG_deviceDist \
                   -g !TBG_gridSize   \
                   -s !TBG_steps      \
                   !TBG_restartLoop  \
                   !TBG_plugins      \
                   --versionOnce"

# TOTAL number of devices
TBG_tasks="$(( TBG_devices_x * TBG_devices_y * TBG_devices_z ))"

"$TBG_cfgPath"/submitAction.sh

PIConGPU throws the next errors:

$ cat stderr.725795
[a02n05:79941] *** Process received signal ***
[a02n05:79941] Signal: Segmentation fault (11)
[a02n05:79941] Signal code: Address not mapped (1)
[a02n05:79941] Failing at address: 0x12a000000008
[a18n18:153800] *** Process received signal ***
[a18n18:153800] Signal: Segmentation fault (11)
[a18n18:153800] Signal code: Address not mapped (1)
[a18n18:153800] Failing at address: 0x25a900000008
[a18n18:153800] [ 0] [0x2000000504d8]
[a18n18:153800] [ 1] [a02n05:79944] *** Process received signal ***
[a02n05:79944] Signal: Segmentation fault (11)
[a02n05:79944] Signal code: Address not mapped (1)
[a02n05:79944] Failing at address: 0x4bea00000008
/gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/simulations/LWFA_ISAAC_perf/lwfa_1280x720/input/bin/picongpu(_ZN5isaac18IsaacVisualizationIN6alpaka6DevCpuENS1_12AccGpuCudaRtISt17integral_constantImLm3EEjEENS1_32QueueUniformCudaHipRtNonBlockingES5_N4mpl_4int_ILi3EEEN5boost6fusion4consIN8picongpu6isaacP14ParticleSourceINSE_9ParticlesIN5pmacc4meta6StringIJLc101ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0EEEENSB_3mpl6v_itemINSE_11chargeRatioINSE_20ChargeRatioElectronsENSI_13pmacc_isAliasEEENSN_INSE_9massRatioINSE_18MassRatioElectronsESQ_EENSN_INSE_7currentINSE_13currentSolver9EsirkepovINSE_9particles6shapes3TSCENSW_8strategy16CachedSupercellsELj3EEESQ_EENSN_INSE_13interpolationINSE_28FieldToParticleInterpolationIS10_NSE_30AssignedTrilinearInterpolationEEESQ_EENSN_INSE_5shapeIS10_SQ_EENSN_INSE_14particlePusherINSY_6pusher5BorisESQ_EENSM_7vector0INS8_2naEEELi0EEELi0EEELi0EEELi0EEELi0EEELi0EEENSN_INSE_9weightingENSN_INSE_8momentumENSN_INSE_8positionINSE_12position_picESQ_EES1I_Li0EEELi0EEELi0EEEEEEENSC_4nil_EEENSD_INSF_12TFieldSourceINSE_6FieldEEEENSD_INS21_INSE_6FieldBEEENSD_INS21_INSE_6FieldJEEENSD_INS21_INSE_17FieldTmpOperationINSY_14particleToGrid24ComputeGridValuePerFrameIS10_NS29_17derivedAttributes7DensityEEES1X_EEEES1Z_EEEEEEEENSI_9DataSpaceILj3EEELj1024ENSI_4math6VectorIfLi3ENS2M_16StandardAccessorENS2M_17StandardNavigatorENS2M_6detail17Vector_componentsEEENS_17DefaultControllerENS_17DefaultCompositorEED2Ev+0x58)[0x10398658]
[a18n18:153800] [ 2] [a02n05:79943] *** Process received signal ***
[a02n05:79943] Signal: Segmentation fault (11)
[a02n05:79943] Signal code: Address not mapped (1)
[a02n05:79943] Failing at address: 0x38e400000008
[a02n05:79943] [ 0] [0x2000000504d8]
[a02n05:79943] [ 1] /gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/simulations/LWFA_ISAAC_perf/lwfa_1280x720/input/bin/picongpu(_ZN5isaac18IsaacVisualizationIN6alpaka6DevCpuENS1_12AccGpuCudaRtISt17integral_constantImLm3EEjEENS1_32QueueUniformCudaHipRtNonBlockingES5_N4mpl_4int_ILi3EEEN5boost6fusion4consIN8picongpu6isaacP14ParticleSourceINSE_9ParticlesIN5pmacc4meta6StringIJLc101ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0EEEENSB_3mpl6v_itemINSE_11chargeRatioINSE_20ChargeRatioElectronsENSI_13pmacc_isAliasEEENSN_INSE_9massRatioINSE_18MassRatioElectronsESQ_EENSN_INSE_7currentINSE_13currentSolver9EsirkepovINSE_9particles6shapes3TSCENSW_8strategy16CachedSupercellsELj3EEESQ_EENSN_INSE_13interpolationINSE_28FieldToParticleInterpolationIS10_NSE_30AssignedTrilinearInterpolationEEESQ_EENSN_INSE_5shapeIS10_SQ_EENSN_INSE_14particlePusherINSY_6pusher5BorisESQ_EENSM_7vector0INS8_2naEEELi0EEELi0EEELi0EEELi0EEELi0EEELi0EEENSN_INSE_9weightingENSN_INSE_8momentumENSN_INSE_8positionINSE_12position_picESQ_EES1I_Li0EEELi0EEELi0EEEEEEENSC_4nil_EEENSD_INSF_12TFieldSourceINSE_6FieldEEEENSD_INS21_INSE_6FieldBEEENSD_INS21_INSE_6FieldJEEENSD_INS21_INSE_17FieldTmpOperationINSY_14particleToGrid24ComputeGridValuePerFrameIS10_NS29_17derivedAttributes7DensityEEES1X_EEEES1Z_EEEEEEEENSI_9DataSpaceILj3EEELj1024ENSI_4math6VectorIfLi3ENS2M_16StandardAccessorENS2M_17StandardNavigatorENS2M_6detail17Vector_componentsEEENS_17DefaultControllerENS_17DefaultCompositorEED2Ev+0x58)[0x10398658]
[a02n05:79943] [ 2] /gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/simulations/LWFA_ISAAC_perf/lwfa_1280x720/input/bin/picongpu(_ZN8picongpu6isaacP11IsaacPlugin12pluginUnloadEv+0x40)[0x1041b1f0]
[a18n18:153800] [ 3] /gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/simulations/LWFA_ISAAC_perf/lwfa_1280x720/input/bin/picongpu(_ZN8picongpu17SimulationStarterINS_21InitialiserControllerENS_16PluginControllerENS_10SimulationEE12pluginUnloadEv+0xb8)[0x10355a28]
[a18n18:153800] [ 4] [a02n05:79941] [ 0] [0x2000000504d8]
[a02n05:79941] [ 1] /gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/simulations/LWFA_ISAAC_perf/lwfa_1280x720/input/bin/picongpu(_ZN5isaac18IsaacVisualizationIN6alpaka6DevCpuENS1_12AccGpuCudaRtISt17integral_constantImLm3EEjEENS1_32QueueUniformCudaHipRtNonBlockingES5_N4mpl_4int_ILi3EEEN5boost6fusion4consIN8picongpu6isaacP14ParticleSourceINSE_9ParticlesIN5pmacc4meta6StringIJLc101ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0EEEENSB_3mpl6v_itemINSE_11chargeRatioINSE_20ChargeRatioElectronsENSI_13pmacc_isAliasEEENSN_INSE_9massRatioINSE_18MassRatioElectronsESQ_EENSN_INSE_7currentINSE_13currentSolver9EsirkepovINSE_9particles6shapes3TSCENSW_8strategy16CachedSupercellsELj3EEESQ_EENSN_INSE_13interpolationINSE_28FieldToParticleInterpolationIS10_NSE_30AssignedTrilinearInterpolationEEESQ_EENSN_INSE_5shapeIS10_SQ_EENSN_INSE_14particlePusherINSY_6pusher5BorisESQ_EENSM_7vector0INS8_2naEEELi0EEELi0EEELi0EEELi0EEELi0EEELi0EEENSN_INSE_9weightingENSN_INSE_8momentumENSN_INSE_8positionINSE_12position_picESQ_EES1I_Li0EEELi0EEELi0EEEEEEENSC_4nil_EEENSD_INSF_12TFieldSourceINSE_6FieldEEEENSD_INS21_INSE_6FieldBEEENSD_INS21_INSE_6FieldJEEENSD_INS21_INSE_17FieldTmpOperationINSY_14particleToGrid24ComputeGridValuePerFrameIS10_NS29_17derivedAttributes7DensityEEES1X_EEEES1Z_EEEEEEEENSI_9DataSpaceILj3EEELj1024ENSI_4math6VectorIfLi3ENS2M_16StandardAccessorENS2M_17StandardNavigatorENS2M_6detail17Vector_componentsEEENS_17DefaultControllerENS_17DefaultCompositorEED2Ev+0x58)[0x10398658]
[a02n05:79941] [ 2] /gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/simulations/LWFA_ISAAC_perf/lwfa_1280x720/input/bin/picongpu(_ZN8picongpu6isaacP11IsaacPlugin12pluginUnloadEv+0x40)[0x1041b1f0]
[a02n05:79941] [ 3] /gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/simulations/LWFA_ISAAC_perf/lwfa_1280x720/input/bin/picongpu(_ZN63_GLOBAL__N__39_tmpxft_0001329e_00000000_6_main_cpp1_ii_5586f50813runSimulationEiPPc+0x664)[0x102fa874]
[a18n18:153800] [ 5] /gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/simulations/LWFA_ISAAC_perf/lwfa_1280x720/input/bin/picongpu(main+0x1c)[0x102eb4ac]
[a18n18:153800] [ 6] /lib64/libc.so.6(+0x25200)[0x200000e75200]
[a18n18:153800] [ 7] /lib64/libc.so.6(__libc_start_main+0xc4)[0x200000e753f4]
[a18n18:153800] *** End of error message ***
[a02n05:79944] [ 0] [0x2000000504d8]
[a02n05:79944] [ 1] /gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/simulations/LWFA_ISAAC_perf/lwfa_1280x720/input/bin/picongpu(_ZN5isaac18IsaacVisualizationIN6alpaka6DevCpuENS1_12AccGpuCudaRtISt17integral_constantImLm3EEjEENS1_32QueueUniformCudaHipRtNonBlockingES5_N4mpl_4int_ILi3EEEN5boost6fusion4consIN8picongpu6isaacP14ParticleSourceINSE_9ParticlesIN5pmacc4meta6StringIJLc101ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0EEEENSB_3mpl6v_itemINSE_11chargeRatioINSE_20ChargeRatioElectronsENSI_13pmacc_isAliasEEENSN_INSE_9massRatioINSE_18MassRatioElectronsESQ_EENSN_INSE_7currentINSE_13currentSolver9EsirkepovINSE_9particles6shapes3TSCENSW_8strategy16CachedSupercellsELj3EEESQ_EENSN_INSE_13interpolationINSE_28FieldToParticleInterpolationIS10_NSE_30AssignedTrilinearInterpolationEEESQ_EENSN_INSE_5shapeIS10_SQ_EENSN_INSE_14particlePusherINSY_6pusher5BorisESQ_EENSM_7vector0INS8_2naEEELi0EEELi0EEELi0EEELi0EEELi0EEELi0EEENSN_INSE_9weightingENSN_INSE_8momentumENSN_INSE_8positionINSE_12position_picESQ_EES1I_Li0EEELi0EEELi0EEEEEEENSC_4nil_EEENSD_INSF_12TFieldSourceINSE_6FieldEEEENSD_INS21_INSE_6FieldBEEENSD_INS21_INSE_6FieldJEEENSD_INS21_INSE_17FieldTmpOperationINSY_14particleToGrid24ComputeGridValuePerFrameIS10_NS29_17derivedAttributes7DensityEEES1X_EEEES1Z_EEEEEEEENSI_9DataSpaceILj3EEELj1024ENSI_4math6VectorIfLi3ENS2M_16StandardAccessorENS2M_17StandardNavigatorENS2M_6detail17Vector_componentsEEENS_17DefaultControllerENS_17DefaultCompositorEED2Ev+0x58)[0x10398658]
[a02n05:79944] [ 2] /gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/simulations/LWFA_ISAAC_perf/lwfa_1280x720/input/bin/picongpu(_ZN8picongpu6isaacP11IsaacPlugin12pluginUnloadEv+0x40)[0x1041b1f0]
[a02n05:79944] [ 3] /gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/simulations/LWFA_ISAAC_perf/lwfa_1280x720/input/bin/picongpu(_ZN8picongpu17SimulationStarterINS_21InitialiserControllerENS_16PluginControllerENS_10SimulationEE12pluginUnloadEv+0xb8)[0x10355a28]
[a02n05:79941] [ 4] /gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/simulations/LWFA_ISAAC_perf/lwfa_1280x720/input/bin/picongpu(_ZN63_GLOBAL__N__39_tmpxft_0001329e_00000000_6_main_cpp1_ii_5586f50813runSimulationEiPPc+0x664)[0x102fa874]
[a02n05:79941] [ 5] /gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/simulations/LWFA_ISAAC_perf/lwfa_1280x720/input/bin/picongpu(main+0x1c)[0x102eb4ac]
[a02n05:79941] [ 6] /gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/simulations/LWFA_ISAAC_perf/lwfa_1280x720/input/bin/picongpu(_ZN8picongpu17SimulationStarterINS_21InitialiserControllerENS_16PluginControllerENS_10SimulationEE12pluginUnloadEv+0xb8)[0x10355a28]
[a02n05:79944] [ 4] /gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/simulations/LWFA_ISAAC_perf/lwfa_1280x720/input/bin/picongpu(_ZN63_GLOBAL__N__39_tmpxft_0001329e_00000000_6_main_cpp1_ii_5586f50813runSimulationEiPPc+0x664)[0x102fa874]
[a02n05:79944] [ 5] /gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/simulations/LWFA_ISAAC_perf/lwfa_1280x720/input/bin/picongpu(main+0x1c)[0x102eb4ac]
[a02n05:79944] [ 6] /lib64/libc.so.6(+0x25200)[0x200000e75200]
[a02n05:79944] [ 7] /lib64/libc.so.6(__libc_start_main+0xc4)[0x200000e753f4]
[a02n05:79944] *** End of error message ***
/gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/simulations/LWFA_ISAAC_perf/lwfa_1280x720/input/bin/picongpu(_ZN8picongpu6isaacP11IsaacPlugin12pluginUnloadEv+0x40)[0x1041b1f0]
[a02n05:79943] [ 3] /gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/simulations/LWFA_ISAAC_perf/lwfa_1280x720/input/bin/picongpu(_ZN8picongpu17SimulationStarterINS_21InitialiserControllerENS_16PluginControllerENS_10SimulationEE12pluginUnloadEv+0xb8)[0x10355a28]
[a02n05:79943] [ 4] /gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/simulations/LWFA_ISAAC_perf/lwfa_1280x720/input/bin/picongpu(_ZN63_GLOBAL__N__39_tmpxft_0001329e_00000000_6_main_cpp1_ii_5586f50813runSimulationEiPPc+0x664)[0x102fa874]
[a02n05:79943] [ 5] /gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/simulations/LWFA_ISAAC_perf/lwfa_1280x720/input/bin/picongpu(main+0x1c)[0x102eb4ac]
[a02n05:79943] [ 6] /lib64/libc.so.6(+0x25200)[0x200000e75200]
[a02n05:79943] [ 7] /lib64/libc.so.6(__libc_start_main+0xc4)[0x200000e753f4]
[a02n05:79943] *** End of error message ***
/lib64/libc.so.6(+0x25200)[0x200000e75200]
[a02n05:79941] [ 7] /lib64/libc.so.6(__libc_start_main+0xc4)[0x200000e753f4]
[a02n05:79941] *** End of error message ***
ERROR:  One or more process (first noticed rank 7) terminated with signal 11 (core dumped)

this is the output from ISAAC-server:

$  isaac --dump /gpfs/alpine/proj-shared/csc434/PIConGPU_ISAAC_SLATE_output &
[1] 15
sh-4.2$ Using web_port=2459, tcp_port=2458 and sim_port=2460

Running ISAAC Master
Starting insitu plugin listener
Launching WebSocketDataConnector
Launching TCPDataConnector
Launching SaveFileImageConnector
Launching JPEG_URI_Stream
New connection, giving id 0 (control)
Group complete, sending to connected interfaces
sh-4.2$ Connection 0 closed.
Removed group 0

For now, I will be dumping the ISAAC timers into files, but will be great to get more insight by using a profiler.

FelixTUD commented 3 years ago

@psychocoderHPC I'm looking into it, a LWFA setup compiles without a problem on hemera with pic dev and isaac 1.5.2

FelixTUD commented 3 years ago

I can reproduce an identical error with an mpi execution of the example, this should help me tracking down the problem

FelixTUD commented 3 years ago

@benjha I might have found the error, you can try and remove the line https://github.com/ComputationalRadiationPhysics/isaac/blob/c7e9ff9bafe9e65811fc116fe06d5db8a51f7c5e/lib/isaac.hpp#L3465 as a hotfix. I need to have a more detailed look into it later, as it seems that json_init_root is only initialized on the master node, thats why it throws seg fault on all other nodes on destruction, let me know if it fixed it for now

benjha commented 3 years ago

Thanks @FelixTUD It worked.

I am testing further...

FelixTUD commented 3 years ago

This should be fixed with #132