ESCOMP / CAM

Community Atmosphere Model
74 stars 136 forks source link

Unable to build CAM6_3_125 with nvhpc compiler #881

Open sjsprecious opened 1 year ago

sjsprecious commented 1 year ago

What happened?

I tried to build cam6_3_125 on Derecho with the nvhpc/23.5 compiler (CPU case), but it failed with the following error message:

ILM error: internal routine gets bad address for outer variable ._dtInit2346
NVFORTRAN-F-0000-Internal compiler error. Errors in ILM file       1  (/glade/derecho/scratch/sunjian/CAM6_3_125/src/physics/cam/aerosol_optics_cam.F90)

I could reproduce a similar error on Casper with the nvhpc/22.2 compiler (CPU case).

In contrast, I can build cam6_3_124 successfully on Derecho with the nvhpc/23.5 compiler (CPU case). I wonder whether the error above is caused by a compiler bug or a code bug from CAM.

What are the steps to reproduce the bug?

  1. ./create_newcase --case /glade/derecho/scratch/sunjian/cam6/case/F2000dev.f19_f19_mg17.derecho.nvhpc --mach derecho --compiler nvhpc --mpilib mpich --compset F2000dev --res f19_f19_mg17 --walltime 01:00:00 --run-unsupported --queue main
  2. cd /glade/derecho/scratch/sunjian/cam6/case/F2000dev.f19_f19_mg17.derecho.nvhpc
  3. ./case.setup
  4. ./case.build

What CAM tag were you using?

cam6_3_125

What machine were you running CAM on?

Other (please explain below)

What compiler were you using?

NVHPC

Path to a case directory, if applicable

No response

Will you be addressing this bug yourself?

No

Extra info

I used Derecho and Casper for testing.

cponder commented 1 year ago

Can you try with the 23.7 compiler? I could file a bug against the NVHPC compiler, but need to know that it isn't already fixed.

sjsprecious commented 1 year ago

Hi @cponder , I tried nvhpc/23.7 on Derecho and still got the same error.

jedwards4b commented 1 year ago

@sjsprecious note that the file in question: src/physics/cam/aerosol_optics_cam.F90 was added to cam in tag 6_3_125.

sjsprecious commented 1 year ago

Thanks @jedwards4b . That is a good point. I will look at that file and see if I find anything suspicious. The thing remains unclear to me is that whether this is indeed a compiler bug from NVHPC, or a code bug in CAM that is caught by nvhpc but unfortunately not by intel.

jedwards4b commented 1 year ago

It's an internal compiler error - so even if there is something wrong with the code, the compiler can't figure out what it is and thus there is still a problem in the compiler that needs to be reported to nvidia. I tried reducing optimization of this file to -00 and still get the same error. Then I tried selectively commenting out portions of the source code in the file and found that commenting out lines 834 - 941 allows the compiler to complete compilation of the rest of the file.

Here is a patch showing the location of the commented code:


diff  SourceMods/src.cam/aerosol_optics_cam.F90 ~/sandboxes/cesm2_x_alpha/components/cam/src/physics/cam/aerosol_optics_cam.F90 
834c834
< #ifdef DOTHIS
---
> 
837a838
> 
838a840
> 
941d942
< #endif
cacraigucar commented 1 year ago

I am suspecting the NVHPC compiler. I repeated Jian's setup on izumi with both the NAG compiler (and adding ./xmlchange DEBUG=TRUE to get past the CSTM compiler bug when it is not set) and the gnu compiler. In both cases I was able to successfully compile.

sjsprecious commented 1 year ago

Thanks @jedwards4b and @cacraigucar for your time and efforts! That is very helpful and it does seem like a compiler bug from NVHPC.

cponder commented 1 year ago

PGI says they'd need a reproducer - which would mean installing the whole framework, right?

jedwards4b commented 1 year ago
git clone https://github.com/ESCOMP/CAM 
cd CAM
git checkout cam6_3_126
./manage_externals/checkout_externals
cd cime/scripts
./create_test SMS_Ln9.f19_f19_mg17.F2000dev.derecho_nvhpc.cam-outfrq9s

(the last step can only be done on derecho - you would need to do a port to the nvhpc system to build there)

sjsprecious commented 1 year ago

Thanks @jedwards4b for providing the detailed instructions.

To @cponder : if you ever built CAM on your NVIDIA's cluster, then you just needed to change Jim's last command to ./create_test SMS_Ln9.f19_f19_mg17.F2000dev.xxx_nvhpc.cam-outfrq9s, where xxx is the name of your NVIDIA's cluster.

cponder commented 1 year ago

Can you attach the full log-file of the build here? I'd like to sift through it to see if we can break-out a minimal set of source files that can reproduce this.

sjsprecious commented 1 year ago

@cponder Here is the full log file with the error message when using Jim's instructions on Derecho.

atm.bldlog.230919-210543.log

cponder commented 1 year ago

Can you please re-build with -j 1 to keep the steps in order?

sjsprecious commented 1 year ago

@jedwards4b could you please let me know where I should add the -j 1 option as requested by Carl?

jedwards4b commented 1 year ago

./xmlchange GMAKE_J=1

sjsprecious commented 1 year ago

Thanks @jedwards4b .

To @cponder : here is the updated log file with -j 1 option as requested.

atm.bldlog.230920-084418.log

cponder commented 1 year ago

In the file

ccs_config/machines/Depends.nvhpc

this make-dependency is being duplicated:

68    mo_optical_props_kernels.o\
69    mo_rte_solver_kernels.o\
70    mo_optical_props_kernels.o\

causing this message to repeat 4 times

/glade/derecho/scratch/sunjian/SMS_Ln9.f19_f19_mg17.F2000dev.derecho_nvhpc.cam-outfrq9s.20230919_205036_jf0r97/Depends.nvhpc:90: target 'mo_optical_props_kernels.o' given more than once in the same rule

Can you remove line 70, and see if the stuff builds any further? The same goes for this file

ccs_config/machines/Depends.nvhpc-gpu

at the same line-numbers.

sjsprecious commented 1 year ago

This duplication has been removed in the recent ccs_config_cesm tag (https://github.com/ESMCI/ccs_config_cesm/blob/main/machines/Depends.nvhpc#L68).

However, even if I removed this line manually in my case and rebuilt it, I still got the same error. See the log file below:

atm.bldlog.230920-125219.log

We do not need to worry about the Depends.nvhpc-gpu file as it is only used for a GPU build.

cponder commented 11 months ago

I have a software-stack on NERSC/perlmutter, based on NVHPC 23.9. The 23.9 compiler is located here if you want to use it:

    /global/cfs/cdirs/nvendor/nvidia/SHARE.perlmutter/Utils/PGI/23.9/CUDA-12.2.2.0_535.104.05_GCC-11.2.0

If you want to try the whole stack, which includes builds of OpenMPI and NetCDF etc., you can add these to your ~/.profile (and log off and back on to activate):

    export LMOD_REDIRECT=yes                # Send command output to STDOUT so it can pipe more easily.
    export LMOD_IGNORE_CACHE=1              # Try this for now, given that we're constantly updating.
    export LMOD_TMOD_FIND_FIRST=1           # Ignore assigned precendence and use path-ordering instead.
                                            # This is essential for PrgEnv's to adjust precedence.

    module use --prepend $SHAREDIR/Modules/Deprecated       # Assign these in reverse-order, to give
    module use --prepend $SHAREDIR/Modules/Legacy           # precedence to the most current.
    module use --prepend $SHAREDIR/Modules/Latest

    module use --append $SHAREDIR/Modules/Bundles           # New framework.
    module use --append $SHAREDIR/Modules/PrgEnv/*/*

Then run the commands

    module load PrgEnv/PGI+OpenMPI/2023-10-05
    module avail

which will show this at the top

    ------ /global/cfs/cdirs/nvendor/nvidia/SHARE.perlmutter/Modules/PrgEnv/PGI+OpenMPI/2023-10-05 -------
       bzip2/1.0.8           lapack/3.11.0             openmpi/5.0.0rc13        pmix/4.2.6
       cube-lib/4.8.1        libfabric/1.15.2.0 (L)    papi/6.0.0               pnetcdf/1.12.3
       cuda/12.2.2.0  (L)    libunwind/1.6.2           perfmon2/4.13.0          szip/2.1.1
       hdf5/1_14_2           netcdf-c/4.9.2            pgi/23.9          (L)    zlib/1.3
       hwloc/2.9.3           netcdf-f/4.6.1            pio/2_6_2

You can load whichever pieces you want from there. It will deal with the dependencies for you, so

    module purge
    module load PrgEnv/PGI+OpenMPI/2023-10-05
    module avail
    module load pio
    module list

will give you all of

    Currently Loaded Modules:
      1) PrgEnv/PGI+OpenMPI/2023-10-05         9) pmix/4.2.6
      2) cuda/12.2.2.0                 (g,c)  10) openmpi/5.0.0rc13 (mpi)
      3) gcc/11.2.0                    (c)    11) pnetcdf/1.12.3
      4) pgi/23.9                             12) hdf5/1_14_2
      5) szip/2.1.1                           13) netcdf-c/4.9.2
      6) zlib/1.3                             14) netcdf-f/4.6.1
      7) hwloc/2.9.3                          15) pio/2_6_2

Next month I can add a bundle for the 23.11 compiler, assuming I can raise the storage-space in the project.

sjsprecious commented 11 months ago

Thanks Carl. I did not have access to the Perlmutter machine but I could ask the CISL staff to help install it on Derecho. Do you recommend to try nvhpc/23.9 now or wait until nvhpc/23.11 is available? Also does nvhpc/23.9 resolve this compiler bug (https://github.com/ESCOMP/CAM/issues/883)?

jedwards4b commented 10 months ago

@cponder I tried your suggestion on perlmutter -

jedwards@perlmutter:login12:~/cesm2_x_alpha> export LMOD_REDIRECT=yes
jedwards@perlmutter:login12:~/cesm2_x_alpha> export LMOD_IGNORE_CACHE=1
jedwards@perlmutter:login12:~/cesm2_x_alpha> export LMOD_TMOD_FIND_FIRST=1 
jedwards@perlmutter:login12:~/cesm2_x_alpha> module use --prepend $SHAREDIR/Modules/Deprecated 
jedwards@perlmutter:login12:~/cesm2_x_alpha> module use --prepend $SHAREDIR/Modules/Legacy
jedwards@perlmutter:login12:~/cesm2_x_alpha> module use --prepend $SHAREDIR/Modules/Latest
jedwards@perlmutter:login12:~/cesm2_x_alpha> module use --append $SHAREDIR/Modules/Bundles   
jedwards@perlmutter:login12:~/cesm2_x_alpha> module use --append $SHAREDIR/Modules/PrgEnv/*/*
jedwards@perlmutter:login12:~/cesm2_x_alpha> module load PrgEnv/PGI+OpenMPI/2023-10-05
Lmod has detected the following error:  The following module(s) are unknown:
"PrgEnv/PGI+OpenMPI/2023-10-05"
areanddee commented 9 months ago

I isolated the PR#881 problem to the %re %im complex real and imaginary part accessors in in the aerosol_optics_cam_sw subroutine by replacing it with the old F77 syntax DBLE() and DIMAG(). I created a reproducer located at /glade/u/home/loft/pr881repro.tar.gz that shows this compilation/non compilation by toggling the ifdef _RDLFIX\. I shared this reproducer with NVIDIA. I should point out that a very simple test of the %re/%im syntax works just fine under nvfortran 23.5, which suggests that the modern Fortran accessors are supported by nvfortran. Subsequently, Jian Sun has verified that CAM builds successfully at cam6_3_137 with the nvhpc/23.5 on Derecho after replacing the "aerosol_optics_cam.F90" code with compiling version of the aerosol_optics_cam module. However, the simulation failed without a clear error message. The same problem exists for nvhpc/23.1 as well. I propose opening a new issue related to the failure to run correctly: perhaps NVIDIA will be able to track down a root cause of the internal compiler error that will also fix the run-time correctness issue.

sjsprecious commented 4 months ago

@fvitt @cacraigucar I just recalled this pending issue for NVHPC compiler and I found out that the compilation error of aerosol_optics_cam.F90 only occurred when I was using the FV dycore. Is this code dycore-dependent?