E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
338 stars 343 forks source link

Build fail in MPAS with nvidiagpu compiler #6470

Open ndkeen opened 3 weeks ago

ndkeen commented 3 weeks ago

With test SMS_Ld1.T62_oEC60to30v3.CMPASO-NYF.pm-gpu_nvidiagpu it has been failing for a while now. I think I mentioned this to @jonbob who said the fail dates matched a PR that recently went in. I thought I had made an issue, but maybe not.

 0 inform,   0 warnings,   1 severes, 0 fatal for ocn_diagnostics_variables_destroy
Target CMakeFiles/ocn.dir/__/__/core_ocean/shared/mpas_ocn_diagnostics_variables.f90.o built in 0.444529 seconds
gmake[2]: *** [mpas-framework/src/CMakeFiles/ocn.dir/build.make:918: mpas-framework/src/CMakeFiles/ocn.dir/__/__/core_ocean/shared/mpas_ocn_diagnostics_variables.f90.o] Error 2
gmake[2]: *** Waiting for unfinished jobs....
ocn_equation_of_state_linear_density_only:
    181, Generating present(tracerssurfacelayervalue(:,:),density(:,:))
         Generating NVIDIA GPU code
        183, !$acc loop gang, vector(128) collapse(2) ! blockidx%x threadidx%x
        184,   ! blockidx%x threadidx%x collapsed
    198, Generating present(tracers(:,:,:),density(:,:))
         Generating NVIDIA GPU code
        200, !$acc loop gang, vector(128) collapse(2) ! blockidx%x threadidx%x
        201,   ! blockidx%x threadidx%x collapsed
Target CMakeFiles/ocn.dir/__/__/core_ocean/shared/mpas_ocn_equation_of_state_jm.f90.o built in 2.100368 seconds
ocn_equation_of_state_linear_density_exp:
    315, Generating present(thermalexpansioncoeff(:,:),tracerssurfacelayervalue(:,:),density(:,:),salinecontractioncoeff(:,:))
         Generating NVIDIA GPU code
        320, !$acc loop gang, vector(128) collapse(2) ! blockidx%x threadidx%x
        321,   ! blockidx%x threadidx%x collapsed
    341, Generating present(tracers(:,:,:),thermalexpansioncoeff(:,:),density(:,:),salinecontractioncoeff(:,:))
         Generating NVIDIA GPU code
        346, !$acc loop gang, vector(128) collapse(2) ! blockidx%x threadidx%x
        347,   ! blockidx%x threadidx%x collapsed
ocn_equation_of_state_wright_density_only:
    211, Generating enter data create(boussinesqpres(:,:),tracertemp(:,:),tracersalt(:,:))
    226, Generating present(boussinesqpres(:,:),tracersalt(:,:),tracertemp(:,:),density(:,:))
xylar commented 3 weeks ago

I build this myself and I don't think the output above is relevant (it's just related to the parallel build getting killed as far as I can tell). The relevant output is:

NVFORTRAN-S-0038-Symbol, topographic_wave_drag, has not been explicitly declared (/pscratch/sd/x/xylar/e3sm_scratch/pm-gpu/SMS_Ld1.T62_oEC60to30v3.CMPASO-NYF.pm-gpu_nvidiagpu.20240613_020012_785h6a/bld/cmake-bld/core_ocean/shared/mpas_ocn_diagnostics_variables.f90: 1023)

This appear to be caused by https://github.com/E3SM-Project/E3SM/pull/6310, which removed the topographic_wave_drag field but missed the OpenACC directive on that line.

xylar commented 3 weeks ago

After fixing the above, I'm now seeing:

NVFORTRAN-S-1061-Procedures called in a compute region must have acc routine information - ocn_subgrid_ssh_lookup (/pscratch/sd/x/xylar/e3sm_scratch/pm-gpu/SMS_Ld1.T62_oEC60to30v3.CMPASO-NYF.pm-gpu_nvidiagpu.20240613_024017_46es2q/bld/cmake-bld/core_ocean/shared/mpas_ocn_diagnostics.f90: 2307)
/global/common/software/nersc/pm-2022q4/spack/linux-sles15-zen/cmake-3.24.3-k5msymx/bin/cmake -E cmake_copy_f90_mod mpas-framework/src/ocn_tracer_advection_mono.mod mpas-framework/src/CMakeFiles/ocn.dir/ocn_tracer_advection_mono.mod.stamp NVHPC
ocn_diagnostic_solve_z_coordinates:
   2307, Accelerator restriction: call to 'ocn_subgrid_ssh_lookup' with no acc routine information
xylar commented 3 weeks ago

This next issue seems to have been introduced by https://github.com/E3SM-Project/E3SM/pull/6288, and it's going to be more of a challenge to address. It seems like it's caused by calling ocn_subgrid_ssh_lookup within an OpenACC loop without having added the required directives.

xylar commented 3 weeks ago

@sbrus89, I made #6471 to fix the first issue. Could you make a PR to fix the second one?

xylar commented 3 weeks ago

It seems like separate PRs probably make sense to fix these issues because they're unrelated to each other but we won't be able to test them on their own because the test isn't currently compiling.

sbrus89 commented 1 week ago

@ndkeen, This appears to be fixed now: https://my.cdash.org/tests/175231189