NCAR / ccpp-scm

CCPP Single Column Model
Other
13 stars 50 forks source link

Add module files for building SCM with spack-stack on Derecho, Hera, Jet, Orion #406

Closed mkavulich closed 10 months ago

mkavulich commented 11 months ago

This PR introduces modulefiles for building SCM for Derecho (Intel) and Hera (Intel, GNU). It should be fairly easy to add analogous modulefiles for other EPIC-supported platforms, so let me know if that's desired.

I ran the regression test suite and there were some differences on Hera as expected. These differences were almost entirely at the precision noise level (<1e-10) except for a tests that had isolated significant differences. The vast majority of diffs across all fields and all tests were exactly 0. Differences from the baseline (compiled from top of develop with the old shell environment files) can be found in the files in the following directories if anyone wants to take a closer look:

Documentation has been updated in the .tex files, but I haven't been able to re-build the PDF yet. For now instructions for building on Derecho are here:

https://docs.google.com/document/d/1Wg5dBIzwhjoYf6BhgmsUJczPEfxD3dA2yTRS5ftSICk/edit

climbfuji commented 11 months ago

I'll push a branch with an example spack-stack file for hera/intel shortly. The only thing that's missing I think is the optional doyxgen. We can consider adding that to spack-stack (but I remember it's a bit of a tricky package). We can also consider adding a separate template for the scm so that users who just want that (and not the full ufs-weather-model) only have to build a few libraries. But then again, the ufs-weather-model template should be good enough.

mkavulich commented 10 months ago

@DomHeinzeller @grantfirl I have updated the new modules to all use spack-stack 1.5.1, and also added one for Derecho GNU. I did re-run the regression tests for Hera Intel/GNU and they all passed but I have not done a comparison with the main branch baseline for Hera to ensure close-ish results; I can do that if you'd like but I just haven't had time yet.

I also rebased my branch on the latest develop to fix the CI tests, all now seem to be passing.

grantfirl commented 10 months ago

@mkavulich I've tried loading hera_intel on Hera with this code, and it works fine for me. I'm guessing that we'll need to tell folks to manually set the SCM_ROOT variable or does it make any sense to try to set it via the lua file?

grantfirl commented 10 months ago

@mkavulich @dustinswales How could we use spack-stack for the CI tests? I'm guessing that if we want to switch that over too, we'll do that in a separate PR?

For example, see https://github.com/JCSDA/spack-stack/blob/release/1.5.1/.github/workflows/ubuntu-ci-x86_64.yaml for setting up the environment?

mkavulich commented 10 months ago

@mkavulich I've tried loading hera_intel on Hera with this code, and it works fine for me. I'm guessing that we'll need to tell folks to manually set the SCM_ROOT variable or does it make any sense to try to set it via the lua file?

I've been meaning to talk to you about that. It seems to me like using this variable is unnecessary complexity, if this is set automatically through a setup script or modulefile why don't we just set it directly in the python script?

grantfirl commented 10 months ago

@mkavulich I've tried loading hera_intel on Hera with this code, and it works fine for me. I'm guessing that we'll need to tell folks to manually set the SCM_ROOT variable or does it make any sense to try to set it via the lua file?

I've been meaning to talk to you about that. It seems to me like using this variable is unnecessary complexity, if this is set automatically through a setup script or modulefile why don't we just set it directly in the python script?

Ya, that should work fine. The whole idea of having SCM_ROOT in the first place was to allow for flexibility with respect to where executables are stored and where the output goes. In the run script, we could check if the SCM_ROOT environment variable exists. If so, use it, if not, find the top level ccpp-scm directory above where the run script is being called and use that.

climbfuji commented 10 months ago

@mkavulich @dustinswales How could we use spack-stack for the CI tests? I'm guessing that if we want to switch that over too, we'll do that in a separate PR?

For example, see https://github.com/JCSDA/spack-stack/blob/release/1.5.1/.github/workflows/ubuntu-ci-x86_64.yaml for setting up the environment?

You could try to pull the containers we create for JEDI CI, they should have all the dependencies you need (but I agree that making this or any other solution a separate PR is better)

grantfirl commented 10 months ago

@mkavulich I'm running into issues on Derecho. It apparently can't find NetCDF-fortran. You don't get this error?

CMake Error at /glade/work/grantf/ccpp-scm/CMakeModules/Modules/FindNetCDF.cmake:246 (message): Unable to properly find NetCDF. Found static libraries at: /glade/work/grantf/ccpp-scm/scm/src/NetCDF_Fortran_LIBRARY-NOTFOUND but could not run nc-config: Call Stack (most recent call first): CMakeLists.txt:67 (find_package)

CMake Error at /glade/u/apps/derecho/23.09/spack/opt/spack/cmake/3.26.3/gcc/7.5.0/k34x/share/cmake-3.26/Modules/FindPackageHandleStandardArgs.cmake:230 (message): Could NOT find NetCDF (missing: Fortran) (found version "4.9.2") Call Stack (most recent call first): /glade/u/apps/derecho/23.09/spack/opt/spack/cmake/3.26.3/gcc/7.5.0/k34x/share/cmake-3.26/Modules/FindPackageHandleStandardArgs.cmake:600 (_FPHSA_FAILURE_MESSAGE) /glade/work/grantf/ccpp-scm/CMakeModules/Modules/FindNetCDF.cmake:312 (find_package_handle_standard_args) CMakeLists.txt:67 (find_package)

I see that the Hera module files have: load("netcdf-c/4.9.2") load("netcdf-fortran/4.6.0")

but the Derecho ones do not. Is there a reason?

grantfirl commented 10 months ago

@mkavulich FYI, if I add the netCDF load commands to the Derecho lua files, everything works fine for me.

mkavulich commented 10 months ago

@grantfirl Thanks for testing this out, I have had some testing frustrations because module purge does not seem to actually fully purge my environment when running different tests (or maybe there's some cmake cacheing going on that I don't understand? I'd say 50/50 a platform error or user error). I fully logged out and logged back in and started with a fresh clone and purged environment, and was able to replicate your issue. I pushed those changes for the Derecho Intel and GNU modulefiles, and also added a default value for SCM_ROOT per our other conversation.

climbfuji commented 10 months ago

@grantfirl Thanks for testing this out, I have had some testing frustrations because module purge does not seem to actually fully purge my environment when running different tests (or maybe there's some cmake cacheing going on that I don't understand? I'd say 50/50 a platform error or user error). I fully logged out and logged back in and started with a fresh clone and purged environment, and was able to replicate your issue. I pushed those changes for the Derecho Intel and GNU modulefiles, and also added a default value for SCM_ROOT per our other conversation.

To me this is a bit suspicious. The spack py-netcdf4 does depend on netcdf-c, so if that module doesn't get loaded automatically when py-netcdf4 is loaded, then something is off. Loading netcdf-fortran should also automatically load netcdf-c if it isn't loaded yet. But of course it doesn't harm to list the version explicitly.

mkavulich commented 10 months ago

To me this is a bit suspicious. The spack py-netcdf4 does depend on netcdf-c, so if that module doesn't get loaded automatically when py-netcdf4 is loaded, then something is off. Loading netcdf-fortran should also automatically load netcdf-c if it isn't loaded yet. But of course it doesn't harm to list the version explicitly.

@climbfuji I agree that it is suspicious that this issue is occurring. There appears to be something going on with different hdf5 versions compared to the system default. When you don't run a module purge prior to running, this is the result: of loading the current derecho_gnu.lua module:

> module load derecho_gnu

Lmod is automatically replacing "intel/2023.0.0" with "gcc/12.2.0".

Lmod Warning: 
------------------------------------------------------------------------------------------------------
The following dependent module(s) are not currently loaded: hdf5/1.14.0 (required by:
py-netcdf4/1.5.8, netcdf-c/4.9.2)
------------------------------------------------------------------------------------------------------

Due to MODULEPATH changes, the following have been reloaded:
  1) cray-mpich/8.1.25     2) craype/2.7.20     3) hdf5/1.12.2     4) ncarcompilers/1.0.0     5) netcdf/4.9.2

The following have been reloaded with a version change:
  1) ncarenv/23.06 => ncarenv/23.09

Running module purge prior to loading makes the load go much more smoothly:

> module load derecho_intel

The following have been reloaded with a version change:
  1) ncarenv/23.06 => ncarenv/23.09

Now, both of those do work, but it maybe the warning does give some hint as to why those netcdf modules need to be explicitly loaded.

climbfuji commented 10 months ago

You have to follow exactly the steps in https://spack-stack.readthedocs.io/en/latest/PreConfiguredSites.html#ncar-wyoming-derecho unless you want to set yourself up for trouble:

module purge
# ignore that the sticky module ncarenv/... is not unloaded
export LMOD_TMOD_FIND_FIRST=yes
module load ncarenv/23.09
module use /glade/work/epicufsrt/contrib/spack-stack/derecho/modulefiles
module load ecflow/5.8.4
module load mysql/8.0.33
mkavulich commented 10 months ago

@climbfuji so I guess that means module purge is required for using spack-stack?

I have omitted ecflow and mysql because we don't use those applications. The new modulefiles appear to be working much better (along with doing a purge first); I pushed the updated Derecho files, and I'll make and test those changes for Hera later. @grantfirl can you try again with the latest files on Derecho (remembering to module purge first)?

climbfuji commented 10 months ago

@climbfuji so I guess that means module purge is required for using spack-stack?

I have omitted ecflow and mysql because we don't use those applications. The new modulefiles appear to be working much better (along with doing a purge first); I pushed the updated Derecho files, and I'll make and test those changes for Hera later. @grantfirl can you try again with the latest files on Derecho (remembering to module purge first)?

Yes - module purge comes first.

mkavulich commented 10 months ago

@climbfuji @grantfirl I am still waiting on help installing LaTeX tools for updating the users guide, but aside from that I think this PR is ready for re-review. I also added modulefiles for Jet and Orion while I was at it since it was simple to add based on the spack-stack instructions Dom sent (I don't have access to any of the other machines).

DomHeinzeller commented 10 months ago

what tools are you missing?

On Nov 27, 2023, at 2:10 PM, Michael Kavulich @.***> wrote:

@climbfuji https://github.com/climbfuji @grantfirl https://github.com/grantfirl I am still waiting on help installing LaTeX tools for updating the users guide, but aside from that I think this PR is ready for re-review. I also added modulefiles for Jet and Orion while I was at it since it was simple to add based on the spack-stack instructions Dom sent (I don't have access to any of the other machines).

— Reply to this email directly, view it on GitHub https://github.com/NCAR/ccpp-scm/pull/406#issuecomment-1828613847, or unsubscribe https://github.com/notifications/unsubscribe-auth/AN7FF5DZJBITCXUX6YEWFXTYGT6TLAVCNFSM6AAAAAA7DNMUPCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRYGYYTGOBUG4. You are receiving this because you were mentioned.

mkavulich commented 10 months ago

what tools are you missing?

I'm still breaking in my new laptop so I'm just now finding out the things that don't work. The SCM documentation builds with latexmk, which doesn't seem to exist on this machine. The main problem is I don't have admin privileges this on my local machine, so I'm relying on RAL IT to install the needed libraries/packages. But also the provided "TeXshop" install doesn't appear to be working either to build LaTeX documents.

I was hoping to get this resolved on my local machine since that's much more convenient, but I just went and installed latexmk in a conda environment on derecho this morning. However, latexmk apparently still won't work, I just get errors:

> make
latexmk -f -pdf -pdflatex="pdflatex" -use-make main.tex
Rc files read:
  NONE
Latexmk: This is Latexmk, John Collins, 20 November 2021, version: 4.76.
Latexmk: applying rule 'pdflatex'...
Rule 'pdflatex': File changes, etc:
   Changed files, or newly in use since previous run(s):
      'main.tex'
------------
Run number 1 of rule 'pdflatex'
------------
------------
Running 'pdflatex  -recorder  "main.tex"'
------------
This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) (preloaded format=pdflatex)
 restricted \write18 enabled.

kpathsea: Running mktexfmt pdflatex.fmt
Can't locate mktexlsr.pl in @INC (@INC contains: /glade/work/kavulich/conda-envs/base_env/share/tlpkg /glade/work/kavulich/conda-envs/base_env/share/texmf-dist/scripts/texlive /glade/u/apps/derecho/23.06/opt/perl5lib/lib/perl5/x86_64-linux-thread-multi /glade/u/apps/derecho/23.06/opt/perl5lib/lib/perl5 /glade/work/kavulich/conda-envs/base_env/lib/perl5/5.32/site_perl /glade/work/kavulich/conda-envs/base_env/lib/perl5/site_perl /glade/work/kavulich/conda-envs/base_env/lib/perl5/5.32/vendor_perl /glade/work/kavulich/conda-envs/base_env/lib/perl5/vendor_perl /glade/work/kavulich/conda-envs/base_env/lib/perl5/5.32/core_perl /glade/work/kavulich/conda-envs/base_env/lib/perl5/core_perl .) at /glade/work/kavulich/conda-envs/base_env/bin/mktexfmt line 23.
BEGIN failed--compilation aborted at /glade/work/kavulich/conda-envs/base_env/bin/mktexfmt line 25.
I can't find the format file `pdflatex.fmt'!
Latexmk: fls file doesn't appear to have been made.
Latexmk: Errors, in force_mode: so I tried finishing targets
Collected error summary (may duplicate other messages):
  pdflatex: Command for 'pdflatex' gave return code 1
      Refer to 'main.log' for details
----------------------
This message may duplicate earlier message.
Latexmk: Failure in processing file 'main.tex':
   *LaTeX didn't generate the expected log file 'main.log'
----------------------
make: *** [Makefile:10: main.pdf] Error 12

Google doesn't provide any help here so I'm stumped as to what's going wrong. Can you or @grantfirl help out with this? I can't find any instructions on how I'm supposed to build this document if not simply with make.

climbfuji commented 10 months ago

I was a professional typesetter for the Journal of Theoretical Astrophysics and other scientific journals for a long time, and we all just used pdflatex. Can you try this instead of latexmk? Also, the inc errors is presumably because a Perl extension is missing.

mkavulich commented 10 months ago

I am a bit less experienced as I haven't really used LaTeX since undergrad :)

The problem appears to exist at the pdflatex level (since it is called by latexmk):

`I can't find the format file `pdflatex.fmt'!

Maybe an issue with the texlive conda package? https://github.com/conda-forge/texlive-core-feedstock/issues/19

I'm really not inclined to be spending my time debugging this at a deep level since @grantfirl and @dustinswales have been apparently doing this successfully in the past, so I'd like to know how they have done it.

climbfuji commented 10 months ago

I am a bit less experienced as I haven't really used LaTeX since undergrad :)

The problem appears to exist at the pdflatex level (since it is called by latexmk):

`I can't find the format file `pdflatex.fmt'!

Maybe an issue with the texlive conda package? conda-forge/texlive-core-feedstock#19

I'm really not inclined to be spending my time debugging this at a deep level since @grantfirl and @dustinswales have been apparently doing this successfully in the past, so I'd like to know how they have done it.

I wonder if we should just defer the documentation build to CI using readthedocs?

mkavulich commented 10 months ago

Converting to readthedocs would be great! I'd also love to remove the PDF from the repository as it bloats the size. But I think that's beyond the scope of this PR.

grantfirl commented 10 months ago

@mkavulich We've definitely talked about switching over to sphinx/readthedocs, but haven't gotten around to it. I think that both Dustin and I just used the TeXShop app for Mac and generated the PDF locally. I think it uses pdflatex.

grantfirl commented 10 months ago

@mkavulich Here is the PDF of the updated docs if you want to include it in this PR: main.pdf

grantfirl commented 10 months ago

@mkavulich I'd like to re-test this on Hera and Derecho so that we can maybe get this merged today.

grantfirl commented 10 months ago

@mkavulich Can you merge in the latest NCAR/main commit: https://github.com/NCAR/ccpp-scm/commit/1d8894fb0494b7f64f63cefbda222bb0d9b8b12c

grantfirl commented 10 months ago

Everything works with Intel/GNU on Hera/Derecho. @mkavulich I'll approve/merge once this is updated to the latest NCAR/main commit.

mkavulich commented 10 months ago

@grantfirl The branch should now be updated, and I tested one more time on Derecho with Intel. I think it's ready to go 👍

climbfuji commented 10 months ago

Yay! Welcome to the spack-stack user community :-)