NASA-LIS / LISF

Land Information System Framework
Apache License 2.0
117 stars 155 forks source link

Add NLDAS-2 netCDF-4 LIS metforcing reader #1578

Closed dmocko closed 2 months ago

dmocko commented 2 months ago

This pull request adds the NLDAS-2 v020 netCDF-4 LIS metforcing reader into the LIS code.

As the v002 GRIB-1 NLDAS-2 data and the v020 netCDF-4 NLDAS-2 data are nearly identical (within round-off error) - the two metforcing readers in LIS also produce nearly identical results.

Note that as part of this pull request, the "Met forcing sources:" option for the NLDAS-2 v002 GRIB-1 data has changed from "NLDAS2" to "NLDAS2 grib". To use the new v020 netCDF-4 metforcing reader, please use "NLDAS2 netcdf" for "Met forcing sources:".

Resolves: #1577

dmocko commented 2 months ago

Testcase directory forthcoming.

v020 netCDF-4 FORA data: https://doi.org/10.5067/THUF4J1RLSYG

v020 netCDF-4 FORB data: https://doi.org/10.5067/96S0R3LFOBTU

dmocko commented 2 months ago

Testcase available here: /discover/nobackup/projects/lis/tools/pull_requests/1578

jvgeiger commented 2 months ago

I am reviewing it.

jvgeiger commented 2 months ago

Hello David. Would you please rework the "REVISION HISTORY" of the new files?

For example, in timeinterp_nldas20.F90, from

! !REVISION HISTORY:
! 02 Feb 2004: Sujay Kumar;  Initial Specification
! 24 Aug 2007: Chuck Alonge; Modified for use with NLDAS-2 data
! 14 Mar 2014: David Mocko:  Added CAPE and PET forcing from NLDAS-2
! 11 Jul 2024: David Mocko,  Modified for use with netCDF-4 format

to

! !REVISION HISTORY:
! 11 Jul 2024: David Mocko,  Initial Specification

or to

! !REVISION HISTORY:
! 11 Jul 2024: David Mocko,  Initial Specification (derived from timeinterp_nldas2.F90)

Thanks.

dmocko commented 2 months ago

Will do

Hello David. Would you please rework the "REVISION HISTORY" of the new files?

Will do.

emkemp commented 2 months ago

Hi @dmocko:

I'm trying to run your code, but I'm confused by the multiple config files. (What is FORB?) Also, I notice the lis.config.nldas.netcdf* files point to /css/nldas/NLDAS2. Do you have a batch script with the SLURM settings to access that directory from the compute nodes?

emkemp commented 2 months ago

@dmocko Disregard the question about FORB, you answered in in the README file. (Note to all: RTFM.)

emkemp commented 2 months ago

@dmocko You can ignore my other question about /css access. LIS can see it just fine on the Milan nodes. (I was probably thinking about a special node to write to /css, but that is not needed for these test cases.)

Notable finding: LIS will typically crash when launched with mpirun with a MPI error:

Abort(1090063) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Unknown error class, error stack: MPIR_Init_thread(189)...: MPID_Init(1547).........: MPIDU_Init_shm_init(180): unable to allocate shared memory

But I don't see that when I use srun. Not sure what the problem is (the domain is 468x228).

emkemp commented 2 months ago

I get bit-wise identical results for the netCDF target output, both w/ and w/o FORB secondary forcing.

dmocko commented 2 months ago

Hi @emkemp

I was traveling yesterday, and wasn't able to respond to your questions.

I added a SLURM script to the pull request directory: /discover/nobackup/projects/lis/tools/pull_requests/1578/nldas.job

It does include this line: #SBATCH --constraint="mil&cssro"

https://www.nccs.nasa.gov/nccs-users/instructional/using-discover/file-system-storage/discover-css

Based on the link above, the "cssro" constraint is listed as being needed, but it's possible by now that by default, all Milan SCU17 nodes have CSS read-only access by default. I had run this job initially without adding the "cssro" constraint.

The "cssrw" constraint is only available for "datamove" nodes.

Are you still getting that crash with mpirun? Can you try the above SLURM script as well?

dmocko commented 2 months ago

But I don't see that when I use srun. Not sure what the problem is (the domain is 468x228).

Minor point, but the NLDAS-2 domain is 464 by 224.

dmocko commented 2 months ago

What is the reason for changing the indentation in the code in commit "0993d3c" and the follow-on commit "303dfe1"?

Is it just personal preference, or do these changes add something to the code?

If it's just personal preference - I would prefer that these commits are backed-out, as I wrote this code, and will maintain it going forward.

emkemp commented 2 months ago

Hi @dmocko

I admit "303dfe1" was probably me being too opinionated. I have no objection to reverting that commit (though I'll need Jim's help on doing that for both GitHub and my local repo on Discover).

But I'm digging in on "0993d3c". Fixed format wastes space, and interferes with seeing how code is nested within the larger file. This is particularly true in nldas20_forcingMod.F90, where most of the code inside each subroutine has the same indentation as the subroutine declaration, which in turn have the same indentation as the parent module declaration. Also, member variables in the nldas20_type_dec type have the same indentation as the type declaration (this specific code confused me when I first looked at it).

Fixed format dates back to the punch card era, and it's a compromise that is no longer needed. Certainly not with brand new code written in-house.

dmocko commented 2 months ago

@emkemp Nope - if you insist on "0993d3c", then I also like "303dfe1" as well. So keep them both.

jvgeiger commented 2 months ago

Would you please give me the ldt.config file that you used to make your lis_input.nldas.noahmp401.nc file?

dmocko commented 2 months ago

Would you please give me the ldt.config file that you used to make your lis_input.nldas.noahmp401.nc file?

Also placed in: /discover/nobackup/projects/lis/tools/pull_requests/1578

In the same directory, I also placed a "data" directory with a required mask file. Please compile LDT big_endian for the mask.

dmocko commented 2 months ago

@dmocko You can ignore my other question about /css access. LIS can see it just fine on the Milan nodes. (I was probably thinking about a special node to write to /css, but that is not needed for these test cases.)

Notable finding: LIS will typically crash when launched with mpirun with a MPI error:

Abort(1090063) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Unknown error class, error stack: MPIR_Init_thread(189)...: MPID_Init(1547).........: MPIDU_Init_shm_init(180): unable to allocate shared memory

But I don't see that when I use srun. Not sure what the problem is (the domain is 468x228).

Hi @emkemp - Are you still getting this error when using mpirun? Have you tried the nldas.job SLURM script I shared?

emkemp commented 2 months ago

Hi @dmocko

It was still crashing with mpirun. I didn't try your job script (no modules were defined), but I did tweak the SBATCH settings.

emkemp commented 2 months ago

@dmocko

I tried your nldas.job (tweaking the lis.config filename, and commenting out the call to dmm_get_sbus.bash), and it again crashes with mpirun.

Abort(1090063) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Un known error class, error stack: MPIR_Init_thread(189)...: MPID_Init(1547).........: MPIDU_Init_shm_init(180): unable to allocate shared memory Abort(1090063) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Un known error class, error stack: MPIR_Init_thread(189)...: MPID_Init(1547).........: MPIDU_Init_shm_init(180): unable to allocate shared memory Abort(1090063) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Unknown error class, error stack: MPIR_Init_thread(189)...: MPID_Init(1547).........: MPIDU_Init_shm_init(180): unable to allocate shared memory Abort(1090063) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Unknown error class, error stack: MPIR_Init_thread(189)...: MPID_Init(1547).........: MPIDU_Init_shm_init(180): unable to allocate shared memory Abort(1090063) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Unknown error class, error stack: MPIR_Init_thread(189)...: MPID_Init(1547).........: MPIDU_Init_shm_init(180): unable to allocate shared memory

jvgeiger commented 2 months ago

@dmocko Thanks for the LDT config file. I am unable to reproduce your domain and parameter file. Do you still have the LDT log files?

jvgeiger commented 2 months ago

@emkemp I was able to run this pull request last week. I will pull the latest commits and try again.

dmocko commented 2 months ago

@emkemp I compiled and ran using these modules:

Currently Loaded Modules:
  1) comp/gcc/13.2.0   2) comp/intel/2023.2.1   3) mpi/impi/2021.11   4) git/2.42.0   5) lisf_7.5_intel_2023.2.1

I also copied my configure.lis file into the pull_requests/1578 directory.

dmocko commented 2 months ago

@dmocko Thanks for the LDT config file. I am unable to reproduce your domain and parameter file. Do you still have the LDT log files?

Amazingly, I do! I also put them in: /discover/nobackup/projects/lis/tools/pull_requests/1578

Obviously, my lis_input and ldt.config files are 5+ years old. I will try to generate the lis_input file again, and then run the new NLDAS-2.0 forcing code again with it to confirm that it still works.

dmocko commented 2 months ago

Obviously, my lis_input and ldt.config files are 5+ years old. I will try to generate the lis_input file again, and then run the new NLDAS-2.0 forcing code again with it to confirm that it still works.

Actually, I just ran the ldt.config file again on Milan, and got the same exact ldtlog file. And the lis_input file is identical, other than some points in the "HYMAP_grid_elevation" field. But since I'm not running HYMAP for this testcase, it's not an issue. Do you see other fields that are different?

I put the new lis_input file here: /discover/nobackup/projects/lis/tools/pull_requests/1578/newldt

dmocko commented 2 months ago

I just pulled the latest commits, and was able to run without any issue using "mpirun" and my SLURM script.

emkemp commented 2 months ago

Huh. When I recompile from scratch with my older SLES15 environment, nldas.job runs w/o issue.

The module list is:

1) comp/gcc/13.2.0 5) lisf_7.5_intel_2023.2.1 9) xpdf/4.04 2) comp/intel/2023.2.1 6) nccmp/1.8.6.5 10) xxdiff/3.2.0 3) mpi/impi/2021.11 7) nco/5.1.7 4) git/2.42.0 8) ncview/2.1.7

I'll experiment more with my newer SLES15 environment.

emkemp commented 2 months ago

Huh part 2. I recompiled in my new SLES15 environment:

1) comp/gcc/13.2.0 5) ImageMagick/7.1.1-15 9) ncview/2.1.7 2) comp/intel/2023.2.1 6) git/2.45.0 10) xpdf/4.04 3) mpi/impi/2021.11 7) nccmp/1.8.6.5 11) xxdiff/3.2.0 4) lisf_7.6_intel_2023.2.1_emk 8) nco/5.1.7

And nldas.job ran successfully 3 times. Maybe I was getting a bad node a few days ago?

emkemp commented 2 months ago

...aaand I get a crash on test 4.

But, the 3 successful tests with my new environment executable were launched (sbatched) in my old environment. This latest test was launched in the new environment.

I wonder if there is a problem introduced by the ImageMagick/7.1.1-15 module. I will investigate.

jvgeiger commented 2 months ago

And the lis_input file is identical, other than some points in the "HYMAP_grid_elevation" field.

That is what I saw as well. Thanks.

emkemp commented 2 months ago

Grrr. Removing ImageMagick/7.1.1-15 doesn't solve the problem. Worse, the most recent failure doesn't have any MPI error message, it just stops with an error code. (I changed the execution line to "mpirun -np 252 ./LIS -f lis.config || exit 1".) There is no output or lislog, the SLURM sjobexitmod command clearly indicates a runtime failure.

[emkemp@discover32 /discover/nobackup/projects/usaf_lis/emkemp/pr/1578/LISF/lis]$ sjobexitmod -l 36405759 JobID Account NNodes NodeList State ExitCode DerivedExitCode Comment


36405759 s1189 2 borgl[103,105] FAILED 1:0 0:0

dmocko commented 2 months ago

@emkemp - But is this a reason to not accept this pull request? It runs with a standard environment.

We need this PR merged into master so it's available as soon as possible for operational users.

Sometime in August, the NLDAS-2 GRIB-1 stream will be turned off at the GES DISC, and users need to transition to the netCDF-4 stream.

emkemp commented 2 months ago

@dmocko Since I'm the only one experiencing this problem, I will say 'no' on denying the pull request. It's still possible there is something in my environment that is interfering. I just would like to know why this is happening. @jvgeiger and I will discuss at 2 pm today.

emkemp commented 2 months ago

@dmocko Jim and I are satisfied. Pull request APPROVED.