Open slevis-lmwg opened 4 months ago
I have submitted a 4-node job and an 8-node job:
qsub mksurfdata_jobscript_single
qsub mksurfdata_jobscript_single_8nodes.sh
in /glade/work/slevis/git/latest_master/tools/mksurfdata_esmf
git describe: ctsm5.2.003
The jobs point to
surfdata_0.9x1.25_hist_2000_78pfts_c240506.namelist
surfdata_0.9x1.25_hist_2000_78pfts_c240506b.namelist
First I compare two files that I expect (hope) to be identical because derecho generated them on the same number of nodes. I'm relieved to find that they are indeed identical:
surfdata_0.9x1.25_hist_2000_78pfts_c240506
surfdata_0.9x1.25_hist_2000_78pfts_c240216
Next I compare the two files that I generated today:
surfdata_0.9x1.25_hist_2000_78pfts_c240506b
surfdata_0.9x1.25_hist_2000_78pfts_c240506
and find diffs as shown in the following sample ncview images.
landmask for reference
My assessment of this visual examination: A very small number of grid cells show differences at f09, but differences in those locations can be large.
This seems like an unlikely thing to be able to work on and resolve by ctsm5.3.0. Since, the number of gridcells affected is small that might be OK, but the fact that the differences is large is concerning.
I wonder if it's related to my "ambiguous nearest neighbors" issue: ESMF issue #276: For nearest-neighbor remapping, ensure results are independent of processor count if there are equidistant source points
You can test by shifting the input datasets by a tiny amount (I used 1e-6°).
This would be nice to fix, but is likely related to the ESMF issue @samsrabin noted about nearest neighbor issues with different PE counts. Falls in the quality of life category (for now), but should be addressed by the CESM3 release.
If this is a quick fix it would let us create more accurate 5.3 surface data. Let's not spend more half a day of active time testing this to see if it work and then implementing it (roughly).
I likely deleted earlier samples of this problem, so I have generated new ones in
/glade/work/slevis/git/mksurfdata_toolchain/tools/mksurfdata_esmf
File generated with 512 tasks: surfdata_0.9x1.25_hist_1850_78pfts_c240826.nc
File generated with 256 tasks: surfdata_0.9x1.25_hist_1850_78pfts_c240826b.nc
In the same directory I placed the difference between these two files: b-a.nc
My latest test still fails unfortunately. I generated an fsurdat file four times as follows:
/glade/derecho/scratch/slevis/temp_work/new_rawdata/tools/mksurfdata_esmf/ptvg_oldmesh1
/glade/derecho/scratch/slevis/temp_work/new_rawdata/tools/mksurfdata_esmf/ptvg_oldmesh2
/glade/derecho/scratch/slevis/temp_work/new_rawdata/tools/mksurfdata_esmf/ptvg_newmesh1
/glade/derecho/scratch/slevis/temp_work/new_rawdata/tools/mksurfdata_esmf/ptvg_newmesh2
where suffixes 1 used 512 tasks and suffixes 2 used 256 tasks and where old (default) mesh and new (tweaked) mesh are, respectively:
11c11
< mksrf_fsoitex_mesh = '/glade/campaign/cesm/cesmdata/inputdata/lnd/clm2/mappingdata/grids/UNSTRUCTgrid_5x5min_nomask_cdf5_c200129.nc'
---
> mksrf_fsoitex_mesh = '/glade/work/samrabin/5x5_meshfile_tweaked/UNSTRUCTgrid_5x5min_nomask_cdf5_c200129.tweaked_latlons.nc'
I used ncdiff and found that the tweaked files differ similarly to the way that the default files differ.
@samsrabin thank you for the time that you put into trying out your hypothesis. I don't know whether this result rules out your hypothesis or whether there is more experimentation that could be done. What are your thoughts? Either way, we will probably need to follow up post ctsm5.3.
Thanks for checking, @slevis-lmwg. Let's plan to do another test once the ESMF bug is fixed—I think your latest test shows that's not the issue, but maybe worth a shot.
As @wwieder pointed out in https://github.com/ESCOMP/CTSM/issues/2744#issuecomment-2334866843, the fix in https://github.com/slevis-lmwg/ctsm/pull/9 is likely to resolve this issue.
Thanks @billsacks I was hoping that might be the case and I'll redo the testing that @slevis-lmwg did and see if that's correct.
Hurray! I tried https://github.com/slevis-lmwg/ctsm/pull/9 out for f09-1850 with 256 processors and 128 and am getting identical results between the two now. So this is really good news!
Brief summary of bug
I ran mksurfdata_esmf on derecho to generate fsurdat/landuse files for the VR grids ne0np4CONUS, ne0np4.ARCTIC, and ne0np4.ARCTICGRIS grids (PR #2490 iss #2487). Accidentally, I tried two PE layouts:
Possibly related to issue #2430.
General bug information
CTSM version you are using: ctsm5.2.001
Does this bug cause significantly incorrect results in the model's science? Maybe
Configurations affected: All ctsm5.2.0 and newer, as well as hacked simulations that use 5.2 fsurdat files
Details of bug
I used
/glade/campaign/cesm/cesmdata/cseg/tools/cime/tools/cprnc/cprnc -m <file1> <file2>
to get info like this:@ekluzek proposed this follow-up: Perform testing with f09 to make easier to visualize (VR are unstructured grids and difficult to view).