ComputationalRadiationPhysics / picongpu

Performance-Portable Particle-in-Cell Simulations for the Exascale Era :sparkles:
https://picongpu.readthedocs.io
Other
705 stars 218 forks source link

Piz Daint: CUDA memory error with random number #2357

Open PrometheusPi opened 7 years ago

PrometheusPi commented 7 years ago

When running the default LWFA example using version 0.3.1 of PIConGPU, the simulation fails during initialization when writing checkpoints.

I could reproduce this with both libSplashed compiled against ADIOS and HDF5 (parallel) only. However, writing hdf5 output via the plugin works just fine, as long as checkpoints are not active.

I use the following modules on Piz Daint:

  1) modules/3.2.10.6
  2) eproxy/2.0.16-6.0.4.1_3.1__g001b199.ari
  3) gcc/5.3.0
  4) craype-haswell
  5) craype-network-aries
  6) craype/2.5.12
  7) cray-mpich/7.6.0
  8) slurm/17.02.7-1
  9) xalt/daint-2016.11
 10) cray-libsci/17.06.1
 11) udreg/2.3.2-6.0.4.0_12.2__g2f9c3ee.ari
 12) ugni/6.0.14-6.0.4.0_14.1__ge7db4a2.ari
 13) pmi/5.0.12
 14) dmapp/7.1.1-6.0.4.0_46.2__gb8abda2.ari
 15) gni-headers/5.0.11-6.0.4.0_7.2__g7136988.ari
 16) xpmem/2.2.2-6.0.4.0_3.1__g43b0535.ari
 17) job/2.2.2-6.0.4.0_8.2__g3c644b5.ari
 18) dvs/2.7_2.2.32-6.0.4.1_7.1__ged1923a
 19) alps/6.4.1-6.0.4.0_7.2__g86d0f3d.ari
 20) rca/2.2.11-6.0.4.0_13.2__g84de67a.ari
 21) atp/2.1.1
 22) perftools-base/6.5.1
 23) PrgEnv-gnu/6.0.4
 24) CMake/3.8.1
 25) cudatoolkit/8.0.61_2.4.3-6.0.4.0_3.1__gb475d12
 26) cray-hdf5-parallel/1.10.0.3

And build the additional libraries using the script of @ax3l here (great tool :+1:) (For hdf5 only, I removed the ADIOS library and rebuild libSplash)

Configuring worked just fine. Compiling produced a massive amount of boost warnings.

The stderr when --checkpoints 5000 is not active:

[CUDA] Error: </.../PIConGPU/picongpu/src/libPMacc/include/simulationControl/SimulationHelper.hpp>:142
what():  [CUDA] Error: out of memory
terminate called after throwing an instance of 'std::runtime_error'

However the simulations runs fine (see stdout):

Running program...
PIConGPUVerbose PHYSICS(1) | Sliding Window is ON
PIConGPUVerbose PHYSICS(1) | Courant c*dt <= 1.00229 ? 1
PIConGPUVerbose PHYSICS(1) | species e: omega_p * dt <= 0.1 ? 0.0247974
PIConGPUVerbose PHYSICS(1) | y-cells per wavelength: 18.0587
PIConGPUVerbose PHYSICS(1) | macro particles per gpu: 1048576
PIConGPUVerbose PHYSICS(1) | typical macro particle weighting: 6955.06
PIConGPUVerbose PHYSICS(1) | UNIT_SPEED 2.99792e+08
PIConGPUVerbose PHYSICS(1) | UNIT_TIME 1.39e-16
PIConGPUVerbose PHYSICS(1) | UNIT_LENGTH 4.16712e-08
PIConGPUVerbose PHYSICS(1) | UNIT_MASS 6.33563e-27
PIConGPUVerbose PHYSICS(1) | UNIT_CHARGE 1.11432e-15
PIConGPUVerbose PHYSICS(1) | UNIT_EFIELD 1.22627e+13
PIConGPUVerbose PHYSICS(1) | UNIT_BFIELD 40903.8
PIConGPUVerbose PHYSICS(1) | UNIT_ENERGY 5.69418e-10
initialization time: 28sec 699msec = 28 sec
  0 % =        0 | time elapsed:             1sec 663msec | avg time per step:   0msec
  5 % =      500 | time elapsed:             7sec 502msec | avg time per step:  11msec
 10 % =     1000 | time elapsed:            13sec 565msec | avg time per step:  12msec
 15 % =     1500 | time elapsed:            21sec 462msec | avg time per step:  12msec
 20 % =     2000 | time elapsed:            27sec 534msec | avg time per step:  12msec
 25 % =     2500 | time elapsed:            35sec 354msec | avg time per step:  12msec
 30 % =     3000 | time elapsed:            41sec 435msec | avg time per step:  12msec
 35 % =     3500 | time elapsed:            49sec 218msec | avg time per step:  11msec
 40 % =     4000 | time elapsed:            55sec 218msec | avg time per step:  11msec
 45 % =     4500 | time elapsed:       1min  2sec 711msec | avg time per step:  11msec
 50 % =     5000 | time elapsed:       1min  8sec 778msec | avg time per step:  12msec
 55 % =     5500 | time elapsed:       1min 16sec 412msec | avg time per step:  11msec
 60 % =     6000 | time elapsed:       1min 22sec 415msec | avg time per step:  11msec
 65 % =     6500 | time elapsed:       1min 30sec  39msec | avg time per step:  11msec
 70 % =     7000 | time elapsed:       1min 36sec 119msec | avg time per step:  12msec
 75 % =     7500 | time elapsed:       1min 43sec 995msec | avg time per step:  11msec
 80 % =     8000 | time elapsed:       1min 50sec 135msec | avg time per step:  12msec
 85 % =     8500 | time elapsed:       1min 57sec 521msec | avg time per step:  11msec
 90 % =     9000 | time elapsed:       2min  3sec 471msec | avg time per step:  11msec
 95 % =     9500 | time elapsed:       2min 11sec  54msec | avg time per step:  11msec
100 % =    10000 | time elapsed:       2min 17sec  13msec | avg time per step:  11msec
calculation  simulation time:  2min 18sec 795msec = 138 sec
full simulation time:  2min 47sec 717msec = 167 sec

The stderr when --checkpoints 5000 is active:

[CUDA] Error: </.../PIConGPU/picongpu/src/libPMacc/include/eventSystem/Manager.tpp>:41
what():  [CUDA] Error: out of memory
terminate called after throwing an instance of 'std::runtime_error'

Here, the simulations dies during initialization (see stdout):

Running program...
PIConGPUVerbose PHYSICS(1) | Sliding Window is ON
PIConGPUVerbose PHYSICS(1) | Courant c*dt <= 1.00229 ? 1
PIConGPUVerbose PHYSICS(1) | species e: omega_p * dt <= 0.1 ? 0.0247974
PIConGPUVerbose PHYSICS(1) | y-cells per wavelength: 18.0587
PIConGPUVerbose PHYSICS(1) | macro particles per gpu: 1048576
PIConGPUVerbose PHYSICS(1) | typical macro particle weighting: 6955.06
PIConGPUVerbose PHYSICS(1) | UNIT_SPEED 2.99792e+08
PIConGPUVerbose PHYSICS(1) | UNIT_TIME 1.39e-16
PIConGPUVerbose PHYSICS(1) | UNIT_LENGTH 4.16712e-08
PIConGPUVerbose PHYSICS(1) | UNIT_MASS 6.33563e-27
PIConGPUVerbose PHYSICS(1) | UNIT_CHARGE 1.11432e-15
PIConGPUVerbose PHYSICS(1) | UNIT_EFIELD 1.22627e+13
PIConGPUVerbose PHYSICS(1) | UNIT_BFIELD 40903.8
PIConGPUVerbose PHYSICS(1) | UNIT_ENERGY 5.69418e-10
initialization time: 28sec 670msec = 28 sec

Additionally to the default command line arguments of the 32 GPU example, I just used --hdf5.period 1000 and --checkpoints 5000. All hdf5 files from the hdf5 plugin were written correctly.

I am not sure, whether the memory errors actually are causing the failure because they occur as well without checkpoints (just a bit differently). Any idea how to solve this issue? I think neither @HighIander 's project nor the TWTS project (cc @BeyondEspresso and @steindev) will work without checkpoints.

PrometheusPi commented 7 years ago

Checking the hdf5 output of the hdf5 plugin, I found no particles. However, the default LWFA setup should contain particles.

# output of h5ls -r simData_0.h5
/                        Group
/data                    Group
/data/0                  Group
/data/0/fields           Group
/data/0/fields/B         Group
/data/0/fields/B/x       Dataset {128, 896, 128}
/data/0/fields/B/y       Dataset {128, 896, 128}
/data/0/fields/B/z       Dataset {128, 896, 128}
/data/0/fields/E         Group
/data/0/fields/E/x       Dataset {128, 896, 128}
/data/0/fields/E/y       Dataset {128, 896, 128}
/data/0/fields/E/z       Dataset {128, 896, 128}
/data/0/fields/e_chargeDensity Dataset {128, 896, 128}
/data/0/fields/e_energyDensity Dataset {128, 896, 128}
/data/0/fields/e_particleMomentumComponent Dataset {128, 896, 128}
/data/0/particles        Group
/data/0/particles/e      Group
/data/0/particles/e/charge Group
/data/0/particles/e/mass Group
/data/0/particles/e/momentum Group
/data/0/particles/e/momentum/x Dataset {NULL}
/data/0/particles/e/momentum/y Dataset {NULL}
/data/0/particles/e/momentum/z Dataset {NULL}
/data/0/particles/e/particlePatches Group
/data/0/particles/e/particlePatches/extent Group
/data/0/particles/e/particlePatches/extent/x Dataset {32}
/data/0/particles/e/particlePatches/extent/y Dataset {32}
/data/0/particles/e/particlePatches/extent/z Dataset {32}
/data/0/particles/e/particlePatches/numParticles Dataset {32}
/data/0/particles/e/particlePatches/numParticlesOffset Dataset {32}
/data/0/particles/e/particlePatches/offset Group
/data/0/particles/e/particlePatches/offset/x Dataset {32}
/data/0/particles/e/particlePatches/offset/y Dataset {32}
/data/0/particles/e/particlePatches/offset/z Dataset {32}
/data/0/particles/e/position Group
/data/0/particles/e/position/x Dataset {NULL}
/data/0/particles/e/position/y Dataset {NULL}
/data/0/particles/e/position/z Dataset {NULL}
/data/0/particles/e/positionOffset Group
/data/0/particles/e/positionOffset/x Dataset {NULL}
/data/0/particles/e/positionOffset/y Dataset {NULL}
/data/0/particles/e/positionOffset/z Dataset {NULL}
/data/0/particles/e/weighting Dataset {NULL}
/data/0/picongpu         Group
/data/0/picongpu/idProvider Group
/data/0/picongpu/idProvider/nextId Dataset {2, 8, 2}
/data/0/picongpu/idProvider/startId Dataset {2, 8, 2}
/header                  Group
PrometheusPi commented 7 years ago

The macro particle counter also results in zero particles.

PrometheusPi commented 7 years ago

With all debug output on one sees that the error occurs after initialization and during the particle distribution according to the density profile.

The error occurs in picongpu/src/picongpu/include/particles/Particles.tpp line 278.

PMACC_KERNEL( KernelFillGridWithParticles< Particles >{} )
        (mapper.getGridDim(), block)
    ( densityFunctor, positionFunctor, totalGpuCellOffset, this->particlesBuffer->getDeviceParti\
cleBox( ), mapper );

The last verbose output message in stdout is

...
PIConGPUVerbose SIMULATION_STATE(16) | Starting simulation from timestep 0
PIConGPUVerbose SIMULATION_STATE(16) | Loading from default values finished
PMaccVerbose MEMORY(1) | DataConnector: sharing access to 'e' (1 uses)
PIConGPUVerbose SIMULATION_STATE(16) | initialize density profile for species e
PrometheusPi commented 7 years ago

This issue comes from the random number generator used during random position initialization. Using quiet start solves the issue.

Thus @BeyondEspresso and @steindev, this will not be an issue with TWTS since all random distributions will be done one the CPU beforehand.

@HighIander and @n01r even when using quiet start, you will most likely encounter the same issue when using an ionization scheme based on probability.

@ax3l or @psychocoderHPC Is setting CUDA_ARCH to 60 correct for the Tesla P100?

Because I am a bit confused: on the CSCS web site they say they use NVIDIA® Tesla® P100 16GB but on this web page there is no such thing as a Tesla P100 - there is just a Pascal P100 (SM_60) and a Tesla V100 (SM_70 CUDA 9 only).

Okay Tesla P100-PCIE-16GB is the same, see here.

psychocoderHPC commented 7 years ago

Please increase the free memory in the file memory.param This should solve your issue. The reason is that the p100 is much more parallel than all gpus before (more smx) therefore there is not enough memory for lmem used for the rng initialization.

sm_60 is correct for the P100.

PrometheusPi commented 7 years ago

@psychocoderHPC Thanks - setting reservedGpuMemorySize to twice its original value (now 350 *1024*1024 * 2) solved the issue.

ax3l commented 7 years ago

we should really integrate the "legacy" RNG for startups into our new state-aware RNG implementation to reduce the extra memory in memory.param dramatically.

psychocoderHPC commented 7 years ago

This will not help. The rng initialization still needs lmem. I am currently thinking about compiling for all architectures check the lmem usage by hand and than keep as much memory free as the worst case architecture needs multiplied by smx times max parallel blocks per smx. -- Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet.

ax3l commented 7 years ago

I reopen this issue until a more generic solution is found

psychocoderHPC commented 7 years ago

self answer to my post https://github.com/ComputationalRadiationPhysics/picongpu/issues/2357#issuecomment-342004101 It is not possible to check the lmem usage for all kernel and than multiply by the maximum hardware threads per GPU. The reason is that a P100 can handle 2048 threads per multiprocessor and contains 56 SMX.