fsurdat: PCT_SAND, PCT_CLAY, ORGANIC differ with different PE layouts on derecho

slevis-lmwg commented 4 months ago

Brief summary of bug

I ran mksurfdata_esmf on derecho to generate fsurdat/landuse files for the VR grids ne0np4CONUS, ne0np4.ARCTIC, and ne0np4.ARCTICGRIS grids (PR #2490 iss #2487). Accidentally, I tried two PE layouts:

I see no diffs in the landuse files.

I see diffs in the fsurdat files. The fsurdat files show the different number of tasks used:

<               :Host = "derecho7" ;
<               :Number-of-tasks = 256 ;
---
>               :Host = "derecho6" ;
>               :Number-of-tasks = 1152 ;

Possibly related to issue #2430.

General bug information

CTSM version you are using: ctsm5.2.001

Does this bug cause significantly incorrect results in the model's science? Maybe

Configurations affected: All ctsm5.2.0 and newer, as well as hacked simulations that use 5.2 fsurdat files

Details of bug

I used /glade/campaign/cesm/cesmdata/cseg/tools/cime/tools/cprnc/cprnc -m <file1> <file2> to get info like this:

 PCT_SAND   (gridcell,nlevsoi)
        281  1523900  ( 38260,     7) ( 77809,     1) ( 30259,     5) ( 30260,     8)
             1523900   9.500000000000000E+01   8.000000000000000E+00 4.5E+01  8.800000000000000E+01 4.2E-05  1.500000000000000E+01
             1523900   9.500000000000000E+01   8.000000000000000E+00          4.300000000000000E+01          4.300000000000000E+01
             1523900  ( 38260,     7) ( 77809,     1)
          avg abs field values:    4.507733154296875E+01    rms diff: 2.2E-01   avg rel diff(npos):  4.2E-05
                                   4.507754516601562E+01                        avg decimal digits(ndif):  0.8 worst:  0.2
 RMS PCT_SAND                         2.1765E-01            NORMALIZED  4.8284E-03

 PCT_CLAY   (gridcell,nlevsoi)
        269  1523900  ( 30936,     8) ( 38260,     1) ( 30260,     8) ( 42656,     6)
             1523900   7.400000000000000E+01   2.000000000000000E+00 4.6E+01  6.400000000000000E+01 6.5E-05  3.400000000000000E+01
             1523900   7.400000000000000E+01   2.000000000000000E+00          1.800000000000000E+01          6.000000000000000E+00
             1523900  ( 30936,     8) ( 38260,     1)
          avg abs field values:    1.737113952636719E+01    rms diff: 1.9E-01   avg rel diff(npos):  6.5E-05
                                   1.737069702148438E+01                        avg decimal digits(ndif):  0.6 worst:  0.1
 RMS PCT_CLAY                         1.9311E-01            NORMALIZED  1.1117E-02

 ORGANIC   (gridcell,nlevsoi)
        290  1523900  ( 36565,     5) (     1,     1) ( 42634,     8) ( 30207,     1)
             1523900   2.974772033691406E+02   0.000000000000000E+00 1.7E+02  1.733897705078125E+02 1.6E-04  4.729569244384766E+01
             1523900   2.974772033691406E+02   0.000000000000000E+00          0.000000000000000E+00          0.000000000000000E+00
             1523900  ( 36565,     5) (     1,     1)
          avg abs field values:    1.125364875793457E+01    rms diff: 8.5E-01   avg rel diff(npos):  1.6E-04
                                   1.124512195587158E+01                        avg decimal digits(ndif):  0.1 worst:  0.0
 RMS ORGANIC                          8.4824E-01            NORMALIZED  7.5403E-02

@ekluzek proposed this follow-up: Perform testing with f09 to make easier to visualize (VR are unstructured grids and difficult to view).

slevis-lmwg commented 4 months ago

I have submitted a 4-node job and an 8-node job:

qsub mksurfdata_jobscript_single
qsub mksurfdata_jobscript_single_8nodes.sh

in /glade/work/slevis/git/latest_master/tools/mksurfdata_esmf git describe: ctsm5.2.003 The jobs point to

surfdata_0.9x1.25_hist_2000_78pfts_c240506.namelist
surfdata_0.9x1.25_hist_2000_78pfts_c240506b.namelist

slevis-lmwg commented 4 months ago

First I compare two files that I expect (hope) to be identical because derecho generated them on the same number of nodes. I'm relieved to find that they are indeed identical:

surfdata_0.9x1.25_hist_2000_78pfts_c240506
surfdata_0.9x1.25_hist_2000_78pfts_c240216

Next I compare the two files that I generated today:

surfdata_0.9x1.25_hist_2000_78pfts_c240506b
surfdata_0.9x1.25_hist_2000_78pfts_c240506

and find diffs as shown in the following sample ncview images.

slevis-lmwg commented 4 months ago

surfdata_f09_2000_78pfts_c240506b-a_pctsand_nlevsoi0

slevis-lmwg commented 4 months ago

landmask for reference

surfdata_f09_2000_78pfts_c240506_pctocn

slevis-lmwg commented 4 months ago

My assessment of this visual examination: A very small number of grid cells show differences at f09, but differences in those locations can be large.

ekluzek commented 3 weeks ago

This seems like an unlikely thing to be able to work on and resolve by ctsm5.3.0. Since, the number of gridcells affected is small that might be OK, but the fact that the differences is large is concerning.

samsrabin commented 3 weeks ago

I wonder if it's related to my "ambiguous nearest neighbors" issue: ESMF issue #276: For nearest-neighbor remapping, ensure results are independent of processor count if there are equidistant source points

You can test by shifting the input datasets by a tiny amount (I used 1e-6°).

wwieder commented 2 weeks ago

This would be nice to fix, but is likely related to the ESMF issue @samsrabin noted about nearest neighbor issues with different PE counts. Falls in the quality of life category (for now), but should be addressed by the CESM3 release.

If this is a quick fix it would let us create more accurate 5.3 surface data. Let's not spend more half a day of active time testing this to see if it work and then implementing it (roughly).

slevis-lmwg commented 2 weeks ago

I likely deleted earlier samples of this problem, so I have generated new ones in /glade/work/slevis/git/mksurfdata_toolchain/tools/mksurfdata_esmf File generated with 512 tasks: surfdata_0.9x1.25_hist_1850_78pfts_c240826.nc File generated with 256 tasks: surfdata_0.9x1.25_hist_1850_78pfts_c240826b.nc In the same directory I placed the difference between these two files: b-a.nc

slevis-lmwg commented 1 week ago

My latest test still fails unfortunately. I generated an fsurdat file four times as follows:

/glade/derecho/scratch/slevis/temp_work/new_rawdata/tools/mksurfdata_esmf/ptvg_oldmesh1
/glade/derecho/scratch/slevis/temp_work/new_rawdata/tools/mksurfdata_esmf/ptvg_oldmesh2
/glade/derecho/scratch/slevis/temp_work/new_rawdata/tools/mksurfdata_esmf/ptvg_newmesh1
/glade/derecho/scratch/slevis/temp_work/new_rawdata/tools/mksurfdata_esmf/ptvg_newmesh2

where suffixes 1 used 512 tasks and suffixes 2 used 256 tasks and where old (default) mesh and new (tweaked) mesh are, respectively:

11c11
<   mksrf_fsoitex_mesh = '/glade/campaign/cesm/cesmdata/inputdata/lnd/clm2/mappingdata/grids/UNSTRUCTgrid_5x5min_nomask_cdf5_c200129.nc'
---
>   mksrf_fsoitex_mesh = '/glade/work/samrabin/5x5_meshfile_tweaked/UNSTRUCTgrid_5x5min_nomask_cdf5_c200129.tweaked_latlons.nc'

I used ncdiff and found that the tweaked files differ similarly to the way that the default files differ.

@samsrabin thank you for the time that you put into trying out your hypothesis. I don't know whether this result rules out your hypothesis or whether there is more experimentation that could be done. What are your thoughts? Either way, we will probably need to follow up post ctsm5.3.

samsrabin commented 1 week ago

Thanks for checking, @slevis-lmwg. Let's plan to do another test once the ESMF bug is fixed—I think your latest test shows that's not the issue, but maybe worth a shot.

billsacks commented 1 week ago

As @wwieder pointed out in https://github.com/ESCOMP/CTSM/issues/2744#issuecomment-2334866843, the fix in https://github.com/slevis-lmwg/ctsm/pull/9 is likely to resolve this issue.

ekluzek commented 1 week ago

Thanks @billsacks I was hoping that might be the case and I'll redo the testing that @slevis-lmwg did and see if that's correct.

ekluzek commented 1 week ago

Hurray! I tried https://github.com/slevis-lmwg/ctsm/pull/9 out for f09-1850 with 256 processors and 128 and am getting identical results between the two now. So this is really good news!

ESCOMP / CTSM