GEOS-ESM / MAPL

MAPL is a foundation layer of the GEOS architecture, whose original purpose is to supplement the Earth System Modeling Framework (ESMF)
https://geos-esm.github.io/MAPL/
Apache License 2.0
26 stars 18 forks source link

Issues with MAPL 2 and NVHPC 24.5 #2884

Open mathomp4 opened 4 weeks ago

mathomp4 commented 4 weeks ago

As our work with NVHPC moves forward, I thought I'd make an issue here delineating the current state for @tclune, @cponder, and others' benefit.

On bucy, the first issue is, sadly, an ICE:

[  4%] Building Fortran object shared/CMakeFiles/MAPL.shared.dir/Shmem/Shmem_implementation.F90.o
NVFORTRAN-S-0000-Internal compiler error. import: module pfl_abstracthandlerptrvector (19462,base=19463) member symbol v_swap (offset=944): not found!  (/home/mathomp4/Models/MAPL2-IFX-2024.2/shared/Shmem/Shmem_implementation.F90: 9)
NVFORTRAN-F-0000-Internal compiler error. interf:new_symbol, symbol not found    9419  (/home/mathomp4/Models/MAPL2-IFX-2024.2/shared/Shmem/Shmem_implementation.F90: 9)
NVFORTRAN/x86-64 Linux 24.5-0: compilation aborted

However, I think I might have a first shot at fixing this. If you look at Shmem_implementation.F90, it has the module procedure style of submodule which was found problematic in #2856 (see https://github.com/GEOS-ESM/MAPL/pull/2856/files#diff-00dc465e797c62da012fc89ef408c7c1f5388fd2480f26f53e1f1a4c479131ef)

So, maybe we need to fix that one up. Sadly it is a LOT of code that needs changing over as that file is super-overloaded. Not hard, but tedious.

tclune commented 3 weeks ago

Hmm. I don't even remember doing (assigning) that one.

The alt is the preferred syntax anyway, so ... assign to @JulesKouatchou

mathomp4 commented 3 weeks ago

An update. @JulesKouatchou kindly provided me with a branch where the submodules are cleaned up and in the "right" style. That did not help sadly. But in doing so, I noticed that pflogger is part of this file.

So, if I build with -DBUILD_WITH_PFLOGGER=NO I can get past this. It's like pflogger is "infecting" the build somehow?

I know in the past I always built MAPL without pflogger with NVHPC, but that was usually because we couldn't build pflogger. It seems now that though we can build pflogger itself, we can't use it. 😞

mathomp4 commented 3 weeks ago

Update. My build with NVHPC last night failed in a couple files.

First was with time_ave_util.x:

NVFORTRAN-S-0192-Argument number 1 to mapl_grid_interior must be a label (/home/mathomp4/Models/MAPL2-develop-NVHPC/Apps/time_ave_util.F90: 1446)
NVFORTRAN-S-0192-Argument number 2 to mapl_grid_interior must be a label (/home/mathomp4/Models/MAPL2-develop-NVHPC/Apps/time_ave_util.F90: 1446)
NVFORTRAN-S-0192-Argument number 3 to mapl_grid_interior must be a label (/home/mathomp4/Models/MAPL2-develop-NVHPC/Apps/time_ave_util.F90: 1446)
NVFORTRAN-S-0192-Argument number 4 to mapl_grid_interior must be a label (/home/mathomp4/Models/MAPL2-develop-NVHPC/Apps/time_ave_util.F90: 1446)
NVFORTRAN-S-0192-Argument number 5 to mapl_grid_interior must be a label (/home/mathomp4/Models/MAPL2-develop-NVHPC/Apps/time_ave_util.F90: 1446)

I'm not sure what this means, but maybe @tclune and @bena-nasa can see why this line:

https://github.com/GEOS-ESM/MAPL/blob/2b9739803e9e88aafee08771c75ed8b78d3547c6/Apps/time_ave_util.F90#L1446

makes it unhappy.

Next up is an ICE:

NVFORTRAN-F-0000-Internal compiler error. Deferred-length character symbol must have descriptor  211773  (/home/mathomp4/Models/MAPL2-develop-NVHPC/Tests/VarspecDescription.F90: 455)
NVFORTRAN/x86-64 Linux 24.5-0: compilation aborted
gmake[2]: *** [Tests/CMakeFiles/ExtDataDriver.x.dir/build.make:127: Tests/CMakeFiles/ExtDataDriver.x.dir/VarspecDescription.F90.o] Error 2

which is waaay out of my league. Maybe @cponder has seen this ICE?

mathomp4 commented 3 weeks ago

Finally, this morning I started up a serial make -j1 install of MAPL2.

Currently, nvfortran has been taking OVER TWO HOURS to build Regrid_Util.F90. using 100% of a CPU core. That seems...long to me. It's not that complex a program. 577 lines.

JulesKouatchou commented 2 weeks ago

@mathomp4 Was this issue resolved after I broke Shmem/Shmem_implementation.F90 into submodule files?

mathomp4 commented 2 weeks ago

@JulesKouatchou I think that was a red herring. I mean, I don't think it hurt, but it seems like it's a more fundamental issue with pflogger. Your updates made it trigger faster since it got to that point quicker.

I suppose a question is: should we bring in your changes anyway? I mean, it's probably a better style. Question for @tclune I suppose

tclune commented 2 weeks ago

I'm not sure I'm following all the details here since multiple tickets are involved. If the preferred style (with arguments) for submodule procedures helps, then yes that change should be made.