E3SM-Ocean-Discussion / E3SM

Ocean discussion repository, for ocean issues and longer-term pull requests for E3SM source code. Please make pull requests that are ready to merge into https://github.com/E3SM-Project/E3SM
https://e3sm.org
Other
1 stars 0 forks source link

Update chicoma-cpu modules #112

Closed xylar closed 1 month ago

xylar commented 1 month ago

Following the recent DST, this merge updates the module files and environment variables on Chicoma-CPU. We note that these updates work well for gnu and nvidia compilers but not yet for intel, which we are continuing to work on. A separate update will be needed to address Chicoma-GPU as well.

xylar commented 1 month ago

This is just a draft so far. I'm having no luck with either gnu or intel on Chicoma-CPU so far. I haven't tried anything else yet.

xylar commented 1 month ago

I've contacted LANL IC about the trouble I'm having with gnu:

/lustre/scratch5/xylar/E3SM/scratch/chicoma-cpu/SMS_D.TL319_oQU240wLI_ais8to30.MPAS_LISIO_JRA1p5.chicoma-cpu_gnu.20241011_152619_r3zumf/bld/e3sm.exe: /opt/cray/pe/gcc-libs/libstdc++.so.6: version `GLIBCXX_3.4.29' not found (required by /lustre/scratch5/xylar/E3SM/scratch/chicoma-cpu/SMS_D.TL319_oQU240wLI_ais8to30.MPAS_LISIO_JRA1p5.chicoma-cpu_gnu.20241011_152619_r3zumf/bld/e3sm.exe)

While it seems clear that there's an RPATH being set to /opt/cray/pe/gcc-libs, I haven't been able to track down where that's coming from. Setting the LD_LIBRARY_PATH didn't help.

xylar commented 1 month ago

On the intel side, it's not finding NetCDF-C or -Fortran, even though we're passing a NETCDF_PATH environment variable that seems correct.

xylar commented 1 month ago

The gnu issue seems similar to https://github.com/E3SM-Project/E3SM/issues/6677

jonbob commented 1 month ago

With the commits I just pushed, I was able to successfully build and run:

So I think at this point we can say we support gnu on chicoma. I'll poke around at intel as well

xylar commented 1 month ago

@jonbob, I'm trying to run a test:

./create_test SMS_D.TL319_oQU240wLI_ais8to30.MPAS_LISIO_JRA1p5.chicoma-cpu_gnu --walltime 00:30:00 --wait -p w23_freddy

This looks to be what you ran successfully. But for me it just seems to be hanging. It hasn't got to ocean time stepping yet and there's very little output in the e3sm log file.

Could you have a quick look and let me know if you see anything obvious?

/users/xylar/scratch5/E3SM/scratch/chicoma-cpu/SMS_D.TL319_oQU240wLI_ais8to30.MPAS_LISIO_JRA1p5.chicoma-cpu_gnu.20241017_103208_7tp6zr
jonbob commented 1 month ago

@xylar -- let me take a peek

jonbob commented 1 month ago

@xylar - it seems to be struggling with the atm data? That doesn't make much sense

xylar commented 1 month ago

In the meantime, I'm trying an optimized run to see how that goes.

xylar commented 1 month ago

SMS_D.TL319_oQU240wLI_ais8to30.MPAS_LISIO_JRA1p5.chicoma-cpu_gnu passed for me in the end. It just took 25 minutes and didn't get to time stepping for a long time. It seems like it might be a file system issue with /usr/projects/e3sm.

xylar commented 1 month ago

SMS.TL319_oQU240wLI_ais8to30.MPAS_LISIO_JRA1p5.chicoma-cpu_gnu passed for me as well.

xylar commented 1 month ago

I tested SMS_D.TL319_oQU240wLI_ais8to30.MPAS_LISIO_JRA1p5.chicoma-cpu_nvidia and it built fine and appeared to be running but timed out before the 30 minutes I gave it (same file system issues as above). Waiting in the queue with a longer test.

xylar commented 1 month ago

I realize it's not a high priority for us but SMS_D.TL319_oQU240wLI_ais8to30.MPAS_LISIO_JRA1p5.chicoma-cpu_nvidia passed for me with a longer job runtime.

xylar commented 1 month ago

@jonbob, at the risk of delaying this further, I think we probably want to follow what Noel is doing on Perlmutter: https://github.com/E3SM-Project/E3SM/pull/6702/files That should at least save us from having to make yet another PR in the near future.

xylar commented 1 month ago

I have gnu and nvidia tests in the queue with the latest updates.

xylar commented 1 month ago

The following both passed:

SMS_D.TL319_oQU240wLI_ais8to30.MPAS_LISIO_JRA1p5.chicoma-cpu_gnu
SMS_D.TL319_oQU240wLI_ais8to30.MPAS_LISIO_JRA1p5.chicoma-cpu_nvidia
xylar commented 1 month ago

Closed in favor of https://github.com/E3SM-Project/E3SM/pull/6705