ComputationalRadiationPhysics / picongpu

Performance-Portable Particle-in-Cell Simulations for the Exascale Era :sparkles:
https://picongpu.readthedocs.io
Other
689 stars 216 forks source link

Stack frame hunt #3870

Open sbastrakov opened 2 years ago

sbastrakov commented 2 years ago

While working on #3860 , we had a discussion with @psychocoderHPC and checked the stack frames produced when using 8th order (4 neighbors) FDTD and the corresponding incident field. Besides the usual suspects (RNG init, png output), there were 336 bytes stack frame in the FDTD kernel and 240 bytes stack frame for the incident field kernel, both with 0 bytes spill stores, 0 bytes spill loads. After looking a little bit into the implementation, we found out the constructor for AOFDTDWeights is actually not constexpr, and also the operator[] has a suspicious check which maybe also makes it not constexpr. So alltogether it is actually not clear what happens with these weights inside the FDTD kernel - are they recalculated each time, or stored in registers (or worse), or some combination of those.

sbastrakov commented 2 years ago

cc @steindev

sbastrakov commented 2 years ago

As investigated by @psychocoderHPC , it is maybe due to PML internals and unrelated to the AOFDTD implementation and we misattributed it due to forgetting FDTD and PML has the same kernel template. To be further investigated.

Edit: indeed it was the PML functor used, not the normal FDTD one

sbastrakov commented 2 years ago

Commenting out this break which is optional there (the function works either way) doesn't reduge the stack frame value for the kernel, but seems to largely reduce the register use there. Replacing it with return makes matters worse in that regard, and replacing the range for loop with a C-style one doesn't change anything.

sbastrakov commented 2 years ago

After some more investigation, the effect also depends on the CUDA version used. E.g. CUDA 11.0 and CUDA 11.4 show different kernels have non-zero stack frames for the same setup.

psychocoderHPC commented 2 years ago

Some more places with stack frames

///home/rwidera/workspace/picongpu/include/pmacc/../picongpu/particles/boundary/Thermal.hpp:99                     auto crossedBoundary = pmacc::DataSpace<simDim>::create(0);
        .loc    116 99 44, function_name $L__info_string842, inlined_at 113 74 29

///home/rwidera/workspace/picongpu/include/pmacc/../pmacc/dimensions/DataSpace.hpp:140                 tmp[i] = value;
        .loc    117 140 17, function_name $L__info_string602, inlined_at 116 99 44
        mov.u32         %r354, 0;
        st.local.u32    [%rd2], %r354;
        st.local.u32    [%rd2+4], %r354;
        st.local.u32    [%rd2+8], %r354;
$L__tmp9619:

///home/rwidera/workspace/picongpu/include/pmacc/../picongpu/particles/boundary/Thermal.hpp:102                         if(offsetToTotalOrigin[d] < m_parameters.beginInternalCellsTotalAllBoundaries[d])
        .loc    116 102 53, function_name $L__info_string842, inlined_at 113 74 29
        setp.lt.s32     %p5, %r15, %r91;
        @%p5 bra        $L__BB33_7;
        bra.uni         $L__BB33_4;

$L__BB33_7:
        .loc    116 0 53
        mov.u32         %r354, -1;
$L__tmp9620:

///home/rwidera/workspace/picongpu/include/pmacc/../picongpu/particles/boundary/Thermal.hpp:103                             crossedBoundary[d] = -1;
        .loc    116 103 29, function_name $L__info_string842, inlined_at 113 74 29
        st.local.u32    [%rd2], %r354;
        bra.uni         $L__BB33_8;
$L__tmp9621:

$L__BB33_4:

///home/rwidera/workspace/picongpu/include/pmacc/../picongpu/particles/boundary/Thermal.hpp:104                         else if(offsetToTotalOrigin[d] >= m_parameters.endInternalCellsTotalAllBoundaries[d])
        .loc    116 104 59, function_name $L__info_string842, inlined_at 113 74 29
        setp.lt.s32     %p6, %r15, %r94;
        @%p6 bra        $L__BB33_6;
        bra.uni         $L__BB33_5;
psychocoderHPC commented 2 years ago

With the current dev I observed stack frames in kernelMoveAndMark with the SPEC benchmark if we use the particle shape PQS

ptxas info    : Compiling entry function '_ZN6alpaka16uniform_cuda_hip6detail20uniformCudaHipKernelINS_12AccGpuCudaRtISt17integral_constantImLm3EEjEES5_jN5cupla16cupla_cuda_async11CuplaKernelIN8picongpu26KernelMoveAndMarkParticlesILj256EN5pmacc20SuperCellDescriptionINSC_4math2CT6VectorIN4mpl_10integral_cIiLi8EEESJ_NSI_IiLi4EEEEENSG_INSI_IiLi2EEESM_SM_EESN_EEEEEEJNSC_12ParticlesBoxINSC_5FrameINSC_15ParticlesBufferINSC_19ParticleDescriptionINSC_4meta6StringIJLc101EEEESL_N5boost3mpl6v_itemINSA_9weightingENS10_INSA_8momentumENS10_INSA_8positionINSA_12position_picENSC_13pmacc_isAliasEEENSZ_7vector0INSH_2naEEELi0EEELi0EEELi0EEENS10_INSA_11chargeRatioINSA_20ChargeRatioElectronsES15_EENS10_INSA_9massRatioINSA_18MassRatioElectronsES15_EENS10_INSA_7currentINSA_13currentSolver3EmZINSA_9particles6shapes3PQSENS1K_8strategy16CachedSupercellsEEES15_EENS10_INSA_13interpolationINSA_28FieldToParticleInterpolationIS1O_NSA_30AssignedTrilinearInterpolationEEES15_EENS10_INSA_5shapeIS1O_S15_EENS10_INSA_14particlePusherINS1M_6pusher5BorisES15_EES19_Li0EEELi0EEELi0EEELi0EEELi0EEELi0EEENSC_17HandleGuardRegionINSC_9particles8policies17ExchangeParticlesENS2C_9DoNothingEEES19_S19_EESL_N8mallocMC9AllocatorIS6_NS2H_16CreationPolicies7ScatterINSA_16DeviceHeapConfigENS2J_11ScatterConf27DefaultScatterHashingParamsEEENS2H_20DistributionPolicies4NoopENS2H_11OOMPolicies10ReturnNullENS2H_19ReservePoolPolicies9AlpakaBufIS6_EENS2H_17AlignmentPolicies6ShrinkINS2W_12ShrinkConfig19DefaultShrinkConfigEEEEELj3EE29OperatorCreatePairStaticArrayILj256EEENSU_ISX_SL_NS10_INSC_9multiMaskENS10_INSC_12localCellIdxES1C_Li0EEELi0EEES29_S2F_S19_NS10_INSC_12NextFramePtrINSH_3argILi1EEEEENS10_INSC_16PreviousFramePtrIS3B_EES19_Li0EEELi0EEEEEEENS2H_19AllocatorHandleImplIS31_EELj3EEENSC_7DataBoxINSC_10PitchedBoxINSE_6VectorIfLi3ENSE_16StandardAccessorENSE_17StandardNavigatorENSE_6detail17Vector_componentsIfLi3EEEEELj3EEEEES3W_jNSA_20PushParticlePerFrameIS22_SL_S1W_EENSC_11AreaMappingILj3ENSC_18MappingDescriptionILj3ESL_EEEEEEEvNS_3VecIT0_T1_EET2_DpT3_' for 'sm_70'
ptxas info    : Function properties for _ZN6alpaka16uniform_cuda_hip6detail20uniformCudaHipKernelINS_12AccGpuCudaRtISt17integral_constantImLm3EEjEES5_jN5cupla16cupla_cuda_async11CuplaKernelIN8picongpu26KernelMoveAndMarkParticlesILj256EN5pmacc20SuperCellDescriptionINSC_4math2CT6VectorIN4mpl_10integral_cIiLi8EEESJ_NSI_IiLi4EEEEENSG_INSI_IiLi2EEESM_SM_EESN_EEEEEEJNSC_12ParticlesBoxINSC_5FrameINSC_15ParticlesBufferINSC_19ParticleDescriptionINSC_4meta6StringIJLc101EEEESL_N5boost3mpl6v_itemINSA_9weightingENS10_INSA_8momentumENS10_INSA_8positionINSA_12position_picENSC_13pmacc_isAliasEEENSZ_7vector0INSH_2naEEELi0EEELi0EEELi0EEENS10_INSA_11chargeRatioINSA_20ChargeRatioElectronsES15_EENS10_INSA_9massRatioINSA_18MassRatioElectronsES15_EENS10_INSA_7currentINSA_13currentSolver3EmZINSA_9particles6shapes3PQSENS1K_8strategy16CachedSupercellsEEES15_EENS10_INSA_13interpolationINSA_28FieldToParticleInterpolationIS1O_NSA_30AssignedTrilinearInterpolationEEES15_EENS10_INSA_5shapeIS1O_S15_EENS10_INSA_14particlePusherINS1M_6pusher5BorisES15_EES19_Li0EEELi0EEELi0EEELi0EEELi0EEELi0EEENSC_17HandleGuardRegionINSC_9particles8policies17ExchangeParticlesENS2C_9DoNothingEEES19_S19_EESL_N8mallocMC9AllocatorIS6_NS2H_16CreationPolicies7ScatterINSA_16DeviceHeapConfigENS2J_11ScatterConf27DefaultScatterHashingParamsEEENS2H_20DistributionPolicies4NoopENS2H_11OOMPolicies10ReturnNullENS2H_19ReservePoolPolicies9AlpakaBufIS6_EENS2H_17AlignmentPolicies6ShrinkINS2W_12ShrinkConfig19DefaultShrinkConfigEEEEELj3EE29OperatorCreatePairStaticArrayILj256EEENSU_ISX_SL_NS10_INSC_9multiMaskENS10_INSC_12localCellIdxES1C_Li0EEELi0EEES29_S2F_S19_NS10_INSC_12NextFramePtrINSH_3argILi1EEEEENS10_INSC_16PreviousFramePtrIS3B_EES19_Li0EEELi0EEEEEEENS2H_19AllocatorHandleImplIS31_EELj3EEENSC_7DataBoxINSC_10PitchedBoxINSE_6VectorIfLi3ENSE_16StandardAccessorENSE_17StandardNavigatorENSE_6detail17Vector_componentsIfLi3EEEEELj3EEEEES3W_jNSA_20PushParticlePerFrameIS22_SL_S1W_EENSC_11AreaMappingILj3ENSC_18MappingDescriptionILj3ESL_EEEEEEEvNS_3VecIT0_T1_EET2_DpT3_
    160 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 48 registers, 27664 bytes smem, 512 bytes cmem[0], 16 bytes cmem[2]
psychocoderHPC commented 2 years ago

Here is some more information on why it is important to remove all stack frame usages: https://stackoverflow.com/a/7816434 It is not only about performance but stack frames will require some additional global memory at runtime. PIConGPU is by default only keeping 300MiB memory on the device free. If we execute a kernel that is using stack frames the result can be out of memory during runtime.

steindev commented 2 years ago

@sbastrakov @psychocoderHPC Any progress or plans for progress here?

psychocoderHPC commented 2 years ago

There are still some kernels (e.g. boundary algorithms ) using stack frames we should fix. There is no fixed plan when it should be fixed.

sbastrakov commented 2 years ago

@psychocoderHPC could you write here the commands to get the stack frames and registers information? Both for me as I've forgotten, and to document if someone else will need it.

psychocoderHPC commented 2 years ago

@psychocoderHPC could you write here the commands to get the stack frames and registers information? Both for me as I've forgotten, and to document if someone else will need it.

pic-build -f -c "-Dalpaka_CUDA_SHOW_REGISTER=ON -Dalpaka_CUDA_KEEP_FILES=ON -Dalpaka_CUDA_SHOW_CODELINES=ON" 2>&1 | tee reg.txt