ComputationalRadiationPhysics / picongpu

Performance-Portable Particle-in-Cell Simulations for the Exascale Era :sparkles:
https://picongpu.readthedocs.io
Other
694 stars 218 forks source link

CUDA 9.2 Double Precision + PCS on P100 #2816

Closed ax3l closed 5 years ago

ax3l commented 5 years ago

Building the example setup in #2815, case 1 & 3 (double-precision Esirkepov or EmZ with PCS) on Hemera, using default modules, fails compiling with:

ptxas error   : Entry function '_ZN6alpaka4exec4cuda6detail10cudaKernelISt17integral_constantImLm3EEjN5cupla11CuplaKernelIN8picongpu26KernelMoveAndMarkParticlesILj256EN5pmacc20SuperCellDescriptionINSA_4math2CT6VectorIN4mpl_10integral_cIiLi8EEESH_NSG_IiLi4EEEEENSE_INSG_IiLi2EEESK_SK_EESL_EEEEEEJNSA_12ParticlesBoxINSA_5FrameINSA_15ParticlesBufferINSA_19ParticleDescriptionINSA_11compileTime6StringIJLc112ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0EEEESJ_N5boost3mpl6v_itemINS8_24placeholder_definition309weightingENSY_INS8_24placeholder_definition288momentumENSY_INS8_24placeholder_definition258positionINS8_24placeholder_definition2712position_picENSA_24placeholder_definition2213pmacc_isAliasEEENSX_7vector0INSF_2naEEELi0EEELi0EEELi0EEENSY_INS8_24placeholder_definition5212densityRatioINS8_25placeholder_definition13021DensityRatioPositronsES18_EENSY_INS8_24placeholder_definition5111chargeRatioINS8_25placeholder_definition12920ChargeRatioPositronsES18_EENSY_INS8_24placeholder_definition509massRatioINS8_25placeholder_definition12818MassRatioPositronsES18_EENSY_INS8_24placeholder_definition467currentINS8_13currentSolver9EsirkepovINS8_9particles6shapes3PCSELj3EEES18_EENSY_INS8_24placeholder_definition4513interpolationINS8_28FieldToParticleInterpolationIS21_NS8_30AssignedTrilinearInterpolationEEES18_EENSY_INS8_24placeholder_definition385shapeIS21_S18_EENSY_INS8_24placeholder_definition3914particlePusherINS1Z_6pusher5BorisES18_EES1C_Li0EEELi0EEELi0EEELi0EEELi0EEELi0EEELi0EEENSA_17HandleGuardRegionINSA_9particles8policies17ExchangeParticlesENS1Z_8boundary29CallPluginsAndDeleteParticlesEEES1C_S1C_EESJ_N8mallocMC9AllocatorINS2X_16CreationPolicies7ScatterINS8_16DeviceHeapConfigENS2Z_11ScatterConf27DefaultScatterHashingParamsEEENS2X_20DistributionPolicies4NoopENS2X_11OOMPolicies10ReturnNullENS2X_19ReservePoolPolicies16SimpleCudaMallocENS2X_17AlignmentPolicies6ShrinkINS3B_12ShrinkConfig19DefaultShrinkConfigEEEEELj3EE29OperatorCreatePairStaticArrayILj256EEENSS_ISV_SJ_NSY_INSA_24placeholder_definition249multiMaskENSY_INSA_24placeholder_definition2312localCellIdxES1F_Li0EEELi0EEES2O_S2V_S1C_NSY_INSA_12NextFramePtrINSF_3argILi1EEEEENSY_INSA_16PreviousFramePtrIS3S_EES1C_Li0EEELi0EEEEEEENS2X_19AllocatorHandleImplIS3G_EELj3EEENSA_7DataBoxINSA_10PitchedBoxINSC_6VectorIdLi3ENSC_16StandardAccessorENSC_17StandardNavigatorENSC_6detail17Vector_componentsEEELj3EEEEES4C_jNS8_20PushParticlePerFrameIS2G_SJ_S28_EENSA_11AreaMappingILj3ENSA_18MappingDescriptionILj3ESJ_EEEEEEEvNS_3vec3VecIT_T0_EET1_DpT2_' uses too much shared data (0xd810 bytes, 0xc000 max)
ptxas error   : Entry function '_ZN6alpaka4exec4cuda6detail10cudaKernelISt17integral_constantImLm3EEjN5cupla11CuplaKernelIN8picongpu26KernelMoveAndMarkParticlesILj256EN5pmacc20SuperCellDescriptionINSA_4math2CT6VectorIN4mpl_10integral_cIiLi8EEESH_NSG_IiLi4EEEEENSE_INSG_IiLi2EEESK_SK_EESL_EEEEEEJNSA_12ParticlesBoxINSA_5FrameINSA_15ParticlesBufferINSA_19ParticleDescriptionINSA_11compileTime6StringIJLc101ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0EEEESJ_N5boost3mpl6v_itemINS8_24placeholder_definition309weightingENSY_INS8_24placeholder_definition288momentumENSY_INS8_24placeholder_definition258positionINS8_24placeholder_definition2712position_picENSA_24placeholder_definition2213pmacc_isAliasEEENSX_7vector0INSF_2naEEELi0EEELi0EEELi0EEENSY_INS8_24placeholder_definition5111chargeRatioINS8_25placeholder_definition12720ChargeRatioElectronsES18_EENSY_INS8_24placeholder_definition509massRatioINS8_25placeholder_definition12618MassRatioElectronsES18_EENSY_INS8_24placeholder_definition467currentINS8_13currentSolver9EsirkepovINS8_9particles6shapes3PCSELj3EEES18_EENSY_INS8_24placeholder_definition4513interpolationINS8_28FieldToParticleInterpolationIS1W_NS8_30AssignedTrilinearInterpolationEEES18_EENSY_INS8_24placeholder_definition385shapeIS1W_S18_EENSY_INS8_24placeholder_definition3914particlePusherINS1U_6pusher5BorisES18_EES1C_Li0EEELi0EEELi0EEELi0EEELi0EEELi0EEENSA_17HandleGuardRegionINSA_9particles8policies17ExchangeParticlesENS1U_8boundary29CallPluginsAndDeleteParticlesEEES1C_S1C_EESJ_N8mallocMC9AllocatorINS2R_16CreationPolicies7ScatterINS8_16DeviceHeapConfigENS2T_11ScatterConf27DefaultScatterHashingParamsEEENS2R_20DistributionPolicies4NoopENS2R_11OOMPolicies10ReturnNullENS2R_19ReservePoolPolicies16SimpleCudaMallocENS2R_17AlignmentPolicies6ShrinkINS35_12ShrinkConfig19DefaultShrinkConfigEEEEELj3EE29OperatorCreatePairStaticArrayILj256EEENSS_ISV_SJ_NSY_INSA_24placeholder_definition249multiMaskENSY_INSA_24placeholder_definition2312localCellIdxES1F_Li0EEELi0EEES2I_S2P_S1C_NSY_INSA_12NextFramePtrINSF_3argILi1EEEEENSY_INSA_16PreviousFramePtrIS3M_EES1C_Li0EEELi0EEEEEEENS2R_19AllocatorHandleImplIS3A_EELj3EEENSA_7DataBoxINSA_10PitchedBoxINSC_6VectorIdLi3ENSC_16StandardAccessorENSC_17StandardNavigatorENSC_6detail17Vector_componentsEEELj3EEEEES46_jNS8_20PushParticlePerFrameIS2B_SJ_S23_EENSA_11AreaMappingILj3ENSA_18MappingDescriptionILj3ESJ_EEEEEEEvNS_3vec3VecIT_T0_EET1_DpT2_' uses too much shared data (0xd810 bytes, 0xc000 max)

The two single-precision cases from the setup in #2815 builds.

Something we forgot to tune or should I just reduce the supercell size to make more space for shared mem per block? Interesting that he pusher throws.

cc @psychocoderHPC @sbastrakov

psychocoderHPC commented 5 years ago

uses too much shared data (0xd810 bytes, 0xc000 max) is the end of the error messages. Your supercell is to large and consume to much shared memory in the kernel move and mark.

You can not use more than 48K shared memory per supercell.

ax3l commented 5 years ago

Ok, just as I expected. So basically I will reduce the supercell since doubles waste too much shared mem.

Offline discussion: also it makes sense the pusher overflows first. It needs E and B, and current deposition only J, in shared mem.

sbastrakov commented 5 years ago

Not sure how useful, but just to float an idea. In case this is the kernel to first run out of shared memory as supercells grow, add a manual CT check that these two shared boxes for fields fit. In case they don't, show a more descriptive error message (smth along "you might wanna decrease supercell size"). Ofc that is not fully precise, but maybe covers majority of cases and we always have the standard error as a fallback.