Open sjsprecious opened 1 year ago
Can you try with the 23.7 compiler? I could file a bug against the NVHPC compiler, but need to know that it isn't already fixed.
Hi @cponder , I tried nvhpc/23.7
on Derecho and still got the same error.
@sjsprecious note that the file in question: src/physics/cam/aerosol_optics_cam.F90 was added to cam in tag 6_3_125.
Thanks @jedwards4b . That is a good point. I will look at that file and see if I find anything suspicious. The thing remains unclear to me is that whether this is indeed a compiler bug from NVHPC, or a code bug in CAM that is caught by nvhpc but unfortunately not by intel.
It's an internal compiler error - so even if there is something wrong with the code, the compiler can't figure out what it is and thus there is still a problem in the compiler that needs to be reported to nvidia. I tried reducing optimization of this file to -00 and still get the same error. Then I tried selectively commenting out portions of the source code in the file and found that commenting out lines 834 - 941 allows the compiler to complete compilation of the rest of the file.
Here is a patch showing the location of the commented code:
diff SourceMods/src.cam/aerosol_optics_cam.F90 ~/sandboxes/cesm2_x_alpha/components/cam/src/physics/cam/aerosol_optics_cam.F90
834c834
< #ifdef DOTHIS
---
>
837a838
>
838a840
>
941d942
< #endif
I am suspecting the NVHPC compiler. I repeated Jian's setup on izumi with both the NAG compiler (and adding ./xmlchange DEBUG=TRUE to get past the CSTM compiler bug when it is not set) and the gnu compiler. In both cases I was able to successfully compile.
Thanks @jedwards4b and @cacraigucar for your time and efforts! That is very helpful and it does seem like a compiler bug from NVHPC.
PGI says they'd need a reproducer - which would mean installing the whole framework, right?
git clone https://github.com/ESCOMP/CAM
cd CAM
git checkout cam6_3_126
./manage_externals/checkout_externals
cd cime/scripts
./create_test SMS_Ln9.f19_f19_mg17.F2000dev.derecho_nvhpc.cam-outfrq9s
(the last step can only be done on derecho - you would need to do a port to the nvhpc system to build there)
Thanks @jedwards4b for providing the detailed instructions.
To @cponder : if you ever built CAM on your NVIDIA's cluster, then you just needed to change Jim's last command to ./create_test SMS_Ln9.f19_f19_mg17.F2000dev.xxx_nvhpc.cam-outfrq9s
, where xxx is the name of your NVIDIA's cluster.
Can you attach the full log-file of the build here? I'd like to sift through it to see if we can break-out a minimal set of source files that can reproduce this.
@cponder Here is the full log file with the error message when using Jim's instructions on Derecho.
Can you please re-build with -j 1
to keep the steps in order?
@jedwards4b could you please let me know where I should add the -j 1
option as requested by Carl?
./xmlchange GMAKE_J=1
Thanks @jedwards4b .
To @cponder : here is the updated log file with -j 1
option as requested.
In the file
ccs_config/machines/Depends.nvhpc
this make-dependency is being duplicated:
68 mo_optical_props_kernels.o\
69 mo_rte_solver_kernels.o\
70 mo_optical_props_kernels.o\
causing this message to repeat 4 times
/glade/derecho/scratch/sunjian/SMS_Ln9.f19_f19_mg17.F2000dev.derecho_nvhpc.cam-outfrq9s.20230919_205036_jf0r97/Depends.nvhpc:90: target 'mo_optical_props_kernels.o' given more than once in the same rule
Can you remove line 70, and see if the stuff builds any further? The same goes for this file
ccs_config/machines/Depends.nvhpc-gpu
at the same line-numbers.
This duplication has been removed in the recent ccs_config_cesm
tag (https://github.com/ESMCI/ccs_config_cesm/blob/main/machines/Depends.nvhpc#L68).
However, even if I removed this line manually in my case and rebuilt it, I still got the same error. See the log file below:
We do not need to worry about the Depends.nvhpc-gpu
file as it is only used for a GPU build.
I have a software-stack on NERSC/perlmutter, based on NVHPC 23.9. The 23.9 compiler is located here if you want to use it:
/global/cfs/cdirs/nvendor/nvidia/SHARE.perlmutter/Utils/PGI/23.9/CUDA-12.2.2.0_535.104.05_GCC-11.2.0
If you want to try the whole stack, which includes builds of OpenMPI and NetCDF etc., you can add these to your ~/.profile (and log off and back on to activate):
export LMOD_REDIRECT=yes # Send command output to STDOUT so it can pipe more easily.
export LMOD_IGNORE_CACHE=1 # Try this for now, given that we're constantly updating.
export LMOD_TMOD_FIND_FIRST=1 # Ignore assigned precendence and use path-ordering instead.
# This is essential for PrgEnv's to adjust precedence.
module use --prepend $SHAREDIR/Modules/Deprecated # Assign these in reverse-order, to give
module use --prepend $SHAREDIR/Modules/Legacy # precedence to the most current.
module use --prepend $SHAREDIR/Modules/Latest
module use --append $SHAREDIR/Modules/Bundles # New framework.
module use --append $SHAREDIR/Modules/PrgEnv/*/*
Then run the commands
module load PrgEnv/PGI+OpenMPI/2023-10-05
module avail
which will show this at the top
------ /global/cfs/cdirs/nvendor/nvidia/SHARE.perlmutter/Modules/PrgEnv/PGI+OpenMPI/2023-10-05 -------
bzip2/1.0.8 lapack/3.11.0 openmpi/5.0.0rc13 pmix/4.2.6
cube-lib/4.8.1 libfabric/1.15.2.0 (L) papi/6.0.0 pnetcdf/1.12.3
cuda/12.2.2.0 (L) libunwind/1.6.2 perfmon2/4.13.0 szip/2.1.1
hdf5/1_14_2 netcdf-c/4.9.2 pgi/23.9 (L) zlib/1.3
hwloc/2.9.3 netcdf-f/4.6.1 pio/2_6_2
You can load whichever pieces you want from there. It will deal with the dependencies for you, so
module purge
module load PrgEnv/PGI+OpenMPI/2023-10-05
module avail
module load pio
module list
will give you all of
Currently Loaded Modules:
1) PrgEnv/PGI+OpenMPI/2023-10-05 9) pmix/4.2.6
2) cuda/12.2.2.0 (g,c) 10) openmpi/5.0.0rc13 (mpi)
3) gcc/11.2.0 (c) 11) pnetcdf/1.12.3
4) pgi/23.9 12) hdf5/1_14_2
5) szip/2.1.1 13) netcdf-c/4.9.2
6) zlib/1.3 14) netcdf-f/4.6.1
7) hwloc/2.9.3 15) pio/2_6_2
Next month I can add a bundle for the 23.11 compiler, assuming I can raise the storage-space in the project.
Thanks Carl. I did not have access to the Perlmutter machine but I could ask the CISL staff to help install it on Derecho. Do you recommend to try nvhpc/23.9
now or wait until nvhpc/23.11
is available? Also does nvhpc/23.9
resolve this compiler bug (https://github.com/ESCOMP/CAM/issues/883)?
@cponder I tried your suggestion on perlmutter -
jedwards@perlmutter:login12:~/cesm2_x_alpha> export LMOD_REDIRECT=yes
jedwards@perlmutter:login12:~/cesm2_x_alpha> export LMOD_IGNORE_CACHE=1
jedwards@perlmutter:login12:~/cesm2_x_alpha> export LMOD_TMOD_FIND_FIRST=1
jedwards@perlmutter:login12:~/cesm2_x_alpha> module use --prepend $SHAREDIR/Modules/Deprecated
jedwards@perlmutter:login12:~/cesm2_x_alpha> module use --prepend $SHAREDIR/Modules/Legacy
jedwards@perlmutter:login12:~/cesm2_x_alpha> module use --prepend $SHAREDIR/Modules/Latest
jedwards@perlmutter:login12:~/cesm2_x_alpha> module use --append $SHAREDIR/Modules/Bundles
jedwards@perlmutter:login12:~/cesm2_x_alpha> module use --append $SHAREDIR/Modules/PrgEnv/*/*
jedwards@perlmutter:login12:~/cesm2_x_alpha> module load PrgEnv/PGI+OpenMPI/2023-10-05
Lmod has detected the following error: The following module(s) are unknown:
"PrgEnv/PGI+OpenMPI/2023-10-05"
I isolated the PR#881 problem to the %re %im complex real and imaginary part accessors in in the aerosol_optics_cam_sw subroutine by replacing it with the old F77 syntax DBLE() and DIMAG(). I created a reproducer located at /glade/u/home/loft/pr881repro.tar.gz that shows this compilation/non compilation by toggling the ifdef _RDLFIX\. I shared this reproducer with NVIDIA. I should point out that a very simple test of the %re/%im syntax works just fine under nvfortran 23.5, which suggests that the modern Fortran accessors are supported by nvfortran. Subsequently, Jian Sun has verified that CAM builds successfully at cam6_3_137 with the nvhpc/23.5 on Derecho after replacing the "aerosol_optics_cam.F90" code with compiling version of the aerosol_optics_cam module. However, the simulation failed without a clear error message. The same problem exists for nvhpc/23.1 as well. I propose opening a new issue related to the failure to run correctly: perhaps NVIDIA will be able to track down a root cause of the internal compiler error that will also fix the run-time correctness issue.
@fvitt @cacraigucar I just recalled this pending issue for NVHPC compiler and I found out that the compilation error of aerosol_optics_cam.F90
only occurred when I was using the FV dycore. Is this code dycore-dependent?
What happened?
I tried to build
cam6_3_125
on Derecho with thenvhpc/23.5
compiler (CPU case), but it failed with the following error message:I could reproduce a similar error on Casper with the
nvhpc/22.2
compiler (CPU case).In contrast, I can build
cam6_3_124
successfully on Derecho with thenvhpc/23.5
compiler (CPU case). I wonder whether the error above is caused by a compiler bug or a code bug from CAM.What are the steps to reproduce the bug?
What CAM tag were you using?
cam6_3_125
What machine were you running CAM on?
Other (please explain below)
What compiler were you using?
NVHPC
Path to a case directory, if applicable
No response
Will you be addressing this bug yourself?
No
Extra info
I used Derecho and Casper for testing.