ESCOMP / CTSM

Community Terrestrial Systems Model (includes the Community Land Model of CESM)
http://www.cesm.ucar.edu/models/cesm2.0/land/
Other
309 stars 313 forks source link

Mesh file for sparse grid for the NUOPC coupler #1731

Open ekluzek opened 2 years ago

ekluzek commented 2 years ago

We need a mesh file that can be used with the NUOPC coupler for the sparse grid.

Here's a sample case for the MCT coupler:

/glade/work/oleson/PPE.n11_ctsm5.1.dev030/cime/scripts/ctsm51c8BGC_PPEn11ctsm51d030_2deg_GSWP3V1_Sparse400_Control_2000

slevis-lmwg commented 2 years ago

@ekluzek in case it's relevant: I wrote instructions in this google doc for creating a high-res sparse grid mesh file for #1773 . Search for "mesh" in the doc to find the relevant section.

Very briefly: 1) To start we need a file containing 1D or 2D variables of latitude and longitude for the grid of interest. If such file exists, I would be happy to try to generate the mesh file. 2) If the lat/lon file also includes a mask of the sparse grid, we would then run mesh_mask_modifier to get that mask into the mesh file.

ekluzek commented 2 years ago

Using the new mesh_modifier tool, I was able to get a mesh file from the domain file. The mesh file for the atm forcing is different though in that it's not modifying the 2D grid it's a simple list of 400 points. So I need to create a SCRIP grid file that describes that list of points from the domain file, and then I convert it into ESMF mesh format.

slevis-lmwg commented 2 years ago

So this is what you need to do:

  1. Generate scrip.nc from .nc where the latter I think is the domain file that you mentioned: ncks --rgr infer --rgr scrip=scrip.nc <file_with_lat_lon_2d>.nc foo.nc (where foo.nc contains metadata and will not be used)
  2. Generate mesh file from scrip.nc
    module load esmf
    ESMF_Scrip2Unstruct scrip.nc lnd_mesh.nc 0

    (This mesh file’s mask = 1 everywhere)

  3. At this point I suspect that you need to update this mesh file’s mask using the mesh_modifier tool to distinguish between land and ocean.
ekluzek commented 2 years ago

Awesome, thanks @slevisconsulting the above helped me to get mesh files created. I got everything setup Friday, but when I run the case it's failing. So I need to debug what's happening and get a working case. The mesh files I created are in...

/glade/work/erik/ctsm_worktrees/main_dev/cime_config/usermods_dirs/sparse_grid400_f19_mv17

Hopefully, the crash I'm seeing is something simple I can figure out.

ekluzek commented 2 years ago

The crash has to do with the connectivity on the forcing grid which is just a list of the 400 points. The suggestion from ESMF is that I make the forcing grid points vertices just be a tiny bit around the cell centers.

Because of the time it's taking to do this project I also plan to bring this work as a user-mod to master and add a test for it.

ekluzek commented 2 years ago

We talked about this at the standup this morning. An idea I got there was to try it with the new land mesh, but without the atm forcing mesh. I tried that and it works. So there's something going on with the new forcing that only has the 400 points. This is something I did suspect.

ekluzek commented 2 years ago

OK, I got a case to work! I couldn't use ncks to make the SCRIP grid file as it would "correct" my vertices to turn it into a regular grid. I was able to use curvilinear_to_SCRIP inside of NCL to write out a SCRIP grid file that I could then convert to a working mesh file. Using unstructured_to_ESMF inside of NCL didn't generate a mesh that I could use. One clue in the final mesh file I could see is that the nodeCount was 1600 (so 4x the number of points [400]) which shows that all of the points are isolated from each other. The mesh files that did NOT work all had a smaller number of total nodes than that which meant that they shared nodes between each other.

slevis-lmwg commented 2 years ago

@ekluzek I understand that you used curvilinear_to_SCRIP instead of ncks, but I didn't follow what you ran to go from SCRIP to successful MESH.

wwieder commented 1 year ago

Seems like this connects with #1919 too. In general we need better documentation on how to do this.

slevis-lmwg commented 1 year ago

@ekluzek and I met to compare notes (this issue #1731 vs. discussion #1919):

adrifoster commented 1 year ago

@ekluzek and @slevis-lmwg I tried running a sparse grid simulation using the steps we talked about on the call today. I got a bunch of ESMF errors. It seems like I should be following what @ekluzek did above?

I'm not sure how to do what you did above, Erik. Do you remember the steps you took?

My log files for the case can be found :

/glade/scratch/afoster/ctsm51FATES_SP_OAAT_Control_2000/run

ekluzek commented 1 year ago

It looks like the issue is in datm, from the PET file as you point out. So one thing to check would be to see if just the change for the MASK_MESH works. I think that should, so that would be good to try.

Another thing to try would be the datm mesh file I created /glade/work/erik/ctsm_worktrees/main_dev/cime_config/usermods_dirs/sparse_grid400_f19_mv17/360x720_gswp3.0v1.c170606v4.dense400_ESMFmesh_c20221012.nc

which does differ from your file.

From reading above I ran into trouble with just using ncks, because it would change the vertices from me, so I couldn't use files created with it. I think that might be the warnings we saw when we worked on this that showed issues with the south pole.

(By the way the unclear messages from ESMF are another example of error checking that doesn't help you figure out the problem. I think this might be something where some better error checking could be done to help us figure out what's wrong).

ekluzek commented 1 year ago

You can also try the land mesh file I created...

/glade/work/erik/ctsm_worktrees/main_dev/cime_config/usermods_dirs/sparse_grid400_f19_mv17/fv1.9x2.5_sparse400_181205v4_ESMFmesh_c20220929.nc

NOTE: That for this you would replace use it for

ATM_DOMAIN_MESH LND_DOMAIN_MESH

And leave MASK_MESH as it was before.

OR -- you would reverse the mask and use it for MASK_MESH, and leave LND_DOMAIN_MESH/ATM_DOMAIN_MESH as they were.

I set it up changing ATM_DOMAIN_MESH/LND_DOMAIN_MESH because that is what made sense to me. But, as we saw with talking with @slevis-lmwg it's more general and simpler to swap out the MASK_MESH file.

That mesh file is also different from yours, and it's not just the mask. So again maybe there was something going on with the ncks conversion?

slevis-lmwg commented 1 year ago

@adrifoster I do see four posts up that I wrote, "We think that his ncks attempt failed because he applied it to a "domain" file" ...followed by a suggestion how to resolve. So the "domain" shortcut probably did not work.

If things continue to fail, let's meet again and go through the full process, as I recommend it above. Let's plan on an hour.

adrifoster commented 1 year ago

It looks like the issue is in datm, from the PET file as you point out. So one thing to check would be to see if just the change for the MASK_MESH works. I think that should, so that would be good to try.

Okay I removed the info in user_nl_datm_streams and submit that (so the only thing that was changed was MASK_MESH. That failed for a different reason...

56: Which are NaNs =  F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F
56: F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F
56: F F F F F F F F F F F F F F F F T F F F F F F F F F F F F F F F F F F F F F F F
56: F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F
56: F F F F F F F F F F F F F F F F F F F F F F F F
56: NaN found in field Sl_lfrin at gridcell index           88
56: ERROR:  ERROR: One or more of the CTSM cap export_1D fields are NaN
87: # of NaNs =            3
87: Which are NaNs =  F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F
87: F F F F F F F F F F F F F F F F F F F F F F F F F F F F T F F F F F F F F F F F
87: F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F T
87: T F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F
87: F F F F F F F F F F F F F F F F F F F F F F F F
87: NaN found in field Sl_lfrin at gridcell index           60
87: NaN found in field Sl_lfrin at gridcell index          111
87: NaN found in field Sl_lfrin at gridcell index          112

Will try your next suggestion next.

adrifoster commented 1 year ago

That still failed... thanks @slevis-lmwg for joining meeting with @mvertens this afternoon. @ekluzek do you want to join as well?

slevis-lmwg commented 1 year ago

@adrifoster I will post my process here. Afterwards we can clean it up and post in #1919.

1) Starting with this file /glade/u/home/forrest/ppe_representativeness/output_v4/clusters.clm51_PPEn02ctsm51d021_2deg_GSWP3V1_leafbiomassesai_PPE3_hist.annual+sd.400.nc generate landmask.nc: a) In matlab (% are comments)

rcent = ncread('clusters.clm51_PPEn02ctsm51d021_2deg_GSWP3V1_leafbiomassesai_PPE3_hist.annual+sd.400.nc','rcent');
landmask = round(~isnan(rcent));  % "~" means "not"
nccreate('clusters.clm51_PPEn02ctsm51d021_2deg_GSWP3V1_leafbiomassesai_PPE3_hist.annual+sd.400.nc','landmask','Dimensions',{'lon',144,'lat',96})
ncwrite('clusters.clm51_PPEn02ctsm51d021_2deg_GSWP3V1_leafbiomassesai_PPE3_hist.annual+sd.400.nc','landmask',landmask);
nccreate('clusters.clm51_PPEn02ctsm51d021_2deg_GSWP3V1_leafbiomassesai_PPE3_hist.annual+sd.400.nc','mod_lnd_props','Dimensions',{'lon',144,'lat',96})
ncwrite('clusters.clm51_PPEn02ctsm51d021_2deg_GSWP3V1_leafbiomassesai_PPE3_hist.annual+sd.400.nc','mod_lnd_props',landmask);

OOPS... Somehow I had write permissions and was able to add the new variables directly to the clusters file. I hope I didn't mess anything up for anyone!

  1. After matlab
    mv clusters.clm51_PPEn02ctsm51d021_2deg_GSWP3V1_leafbiomassesai_PPE3_hist.annual+sd.400.nc landmask.nc
    ncks --rgr infer --rgr scrip=scrip.nc landmask.nc foo.nc
    /glade/u/apps/ch/opt/esmf-netcdf/8.0.0/intel/19.0.5/bin/bing/Linux.intel.64.mpiuni.default/ESMF_Scrip2Unstruct scrip.nc lnd_mesh.nc 0

@adrifoster with the above, I am still attempting a shortcut: I generated the mesh file directly from the landmask file, in order to skip the mesh_mask_modifier step.

In the run where you point to the default atmosphere drivers (not the sparse version), set the three mesh paths to /glade/scratch/slevis/temp_work/sparse_grid/lnd_mesh.nc

@adrifoster says that the above seems to have worked, so we will next try to apply the sparse grid on the datm data, as well.

slevis-lmwg commented 1 year ago

From discussing with @olyson: The 1D datm domain file came from combining the 2D gswp3 data and the dense400 mask. So... in matlab I will take the 1D mask from the domain file /glade/p/cgd/tss/people/oleson/modify_domain/domain.lnd.360x720_gswp3.0v1.c170606.dense400.nc and put it in a 2D mask in a 2D gswp3 file /glade/p/cgd/tss/CTSM_datm_forcing_data/atm_forcing.datm7.GSWP3.0.5d.v1.1.c181207/TPHWL/clmforc.GSWP3.c2011.0.5x0.5.TPQWL.2014-01.nc That will be our starting file for the rest of the steps.

  1. In matlab:

    longxy = ncread('clmforc.GSWP3.c2011.0.5x0.5.TPQWL.2014-01.nc','LONGXY');
    latixy = ncread('clmforc.GSWP3.c2011.0.5x0.5.TPQWL.2014-01.nc','LATIXY');
    lon = ncread('../../../people/oleson/modify_domain/domain.lnd.360x720_gswp3.0v1.c170606.dense400.nc','xc');
    lat = ncread('../../../people/oleson/modify_domain/domain.lnd.360x720_gswp3.0v1.c170606.dense400.nc','yc');
    mask = zeros(720,360);                                                                                
    for cell = 1:400                                                                                      
    [i,j] = find(latixy==lat(cell) & longxy==lon(cell));                                                  
    mask(i,j) = 1;                                                                                        
    end

    In a copy of the atmosphere file, so as to avoid overwriting the original, still in matlab:

    nccreate('clmforc.GSWP3.c2011.0.5x0.5.TPQWL.2014-01.nc','landmask','Dimensions',{'lon',720,'lat',360})
    ncwrite('clmforc.GSWP3.c2011.0.5x0.5.TPQWL.2014-01.nc','landmask',mask);
    nccreate('clmforc.GSWP3.c2011.0.5x0.5.TPQWL.2014-01.nc','mod_lnd_props','Dimensions',{'lon',720,'lat',360})
    ncwrite('clmforc.GSWP3.c2011.0.5x0.5.TPQWL.2014-01.nc','mod_lnd_props',mask);
  2. After matlab

    mv clmforc.GSWP3.c2011.0.5x0.5.TPQWL.2014-01.nc landmask.nc
    ncks --rgr infer --rgr scrip=scrip.nc landmask.nc foo.nc
    /glade/u/apps/ch/opt/esmf-netcdf/8.0.0/intel/19.0.5/bin/bing/Linux.intel.64.mpiuni.default/ESMF_Scrip2Unstruct scrip.nc lnd_mesh.nc 0

    @adrifoster this mesh, intended for datm, is located here: /glade/scratch/slevis/temp_work/sparse_grid/datm_mesh You already have the other one, but I should mention that I moved it here: /glade/scratch/slevis/temp_work/sparse_grid/land_mesh

slevis-lmwg commented 1 year ago

@adrifoster I hope the run still works when you now point to the dense400 datm files and the datm mesh that I generated (previous post).

adrifoster commented 1 year ago

Unfortunately that did not work.

see logs /glade/scratch/afoster/ctsm51FATES_SP_OAAT_Control_2000_nuopctest/run

adrifoster commented 1 year ago

case is /glade/work/afoster/FATES_calibration/FATES_SP_OAAT/ctsm51FATES_SP_OAAT_Control_2000_nuopctest

slevis-lmwg commented 1 year ago

Thoughts: 1) I think I see the problem:

slevis-lmwg commented 1 year ago

@adrifoster I will wait to hear whether (2) works and, if so, whether it's an acceptable solution before I start on (3).

adrifoster commented 1 year ago

Okay thanks! I am testing now whether adding just:

CLMGSWP3v1.TPQW:mapalgo = nn
CLMGSWP3v1.TPQW:meshfile =/glade/scratch/slevis/temp_work/sparse_grid/datm_mesh/lnd_mesh.nc

CLMGSWP3v1.Solar:mapalgo = nn
CLMGSWP3v1.Solar:meshfile =/glade/scratch/slevis/temp_work/sparse_grid/datm_mesh/lnd_mesh.nc

CLMGSWP3v1.Precip:mapalgo = nn
CLMGSWP3v1.Precip:meshfile =/glade/scratch/slevis/temp_work/sparse_grid/datm_mesh/lnd_mesh.nc

to user_nl_datm_streams will work

adrifoster commented 1 year ago

This seems to be working! Though it is quite slow... I'm not sure why

slevis-lmwg commented 1 year ago

Hmm, does it seem slower that not using the datm_mesh file?

ekluzek commented 1 year ago

Yes, the intention is for the file to be faster, because it's doing less reading and no interpolation. If it's slower that's defeating the purpose of doing this part.

adrifoster commented 1 year ago

actually hold on that... I think i forgot to turn debug off.

adrifoster commented 1 year ago

Okay so it didn't really seem to change the timing much. In fact the one that used the mesh file for the DATM data was a bit slower:

Timing File for Case WITH DATM Mesh:

cat cesm_timing.ctsm51FATES_SP_OAAT_Control_2000_nuopctest.3834956.chadmin1.ib0.cheyenne.ucar.edu.231012-162352
---------------- TIMING PROFILE ---------------------
  Case        : ctsm51FATES_SP_OAAT_Control_2000_nuopctest
  LID         : 3834956.chadmin1.ib0.cheyenne.ucar.edu.231012-162352
  Machine     : cheyenne
  Caseroot    : /glade/work/afoster/FATES_calibration/FATES_SP_OAAT/ctsm51FATES_SP_OAAT_Control_2000_nuopctest
  Timeroot    : /glade/work/afoster/FATES_calibration/FATES_SP_OAAT/ctsm51FATES_SP_OAAT_Control_2000_nuopctest/Tools
  User        : afoster
  Curr Date   : Thu Oct 12 17:02:58 2023
  Driver      : CMEPS
  grid        : a%1.9x2.5_l%1.9x2.5_oi%null_r%null_g%null_w%null_z%null_m%gx1v7
  compset     : 2000_DATM%GSWP3v1_CLM51%FATES_SICE_SOCN_SROF_SGLC_SWAV_SESP
  run type    : startup, continue_run = TRUE (inittype = FALSE)
  stop option : nyears, stop_n = 10
  run length  : 3650 days (3650.0 for ocean)

  component       comp_pes    root_pe   tasks  x threads instances (stride) 
  ---------        ------     -------   ------   ------  ---------  ------  
  cpl = cpl        72          36       72     x 1       1      (1     ) 
  atm = datm       36          0        36     x 1       1      (1     ) 
  lnd = clm        72          36       72     x 1       1      (1     ) 
  ice = sice       72          36       72     x 1       1      (1     ) 
  ocn = socn       72          36       72     x 1       1      (1     ) 
  rof = srof       72          36       72     x 1       1      (1     ) 
  glc = sglc       72          36       72     x 1       1      (1     ) 
  wav = swav       72          36       72     x 1       1      (1     ) 
  esp = sesp       1           0        1      x 1       1      (1     ) 

  total pes active           : 108 
  mpi tasks per node         : 36 
  pe count for cost estimate : 108 

  Overall Metrics: 
    Model Cost:               7.01   pe-hrs/simulated_year 
    Model Throughput:       369.95   simulated_years/day 

    Init Time   :       6.237 seconds 
    Run Time    :    2335.434 seconds        0.640 seconds/day 
    Final Time  :       1.449 seconds 

Runs Time in total seconds, seconds/model-day, and model-years/wall-day 
CPL Run Time represents time in CPL pes alone, not including time associated with data exchange with other components 

    TOT Run Time:    2335.434 seconds        0.640 seconds/mday       369.95 myears/wday 
    CPL Run Time:     282.852 seconds        0.077 seconds/mday      3054.60 myears/wday 
    ATM Run Time:    1108.467 seconds        0.304 seconds/mday       779.45 myears/wday 
    LND Run Time:    1308.653 seconds        0.359 seconds/mday       660.22 myears/wday 
    ICE Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday 
    OCN Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday 
    ROF Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday 
    GLC Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday 
    WAV Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday 
    ESP Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday 
    CPL COMM Time:   1110.305 seconds        0.304 seconds/mday       778.16 myears/wday 
   NOTE: min:max driver timers (seconds/day):   
                            CPL (pes 36 to 107) 
                                                ATM (pes 0 to 35) 
                                                                                      LND (pes 36 to 107) 
                                                                                      ICE (pes 36 to 107) 
                                                                                      OCN (pes 36 to 107) 
                                                                                      ROF (pes 36 to 107) 
                                                                                      GLC (pes 36 to 107) 
                                                                                      WAV (pes 36 to 107) 
                                                ESP (pes 0 to 0) 

Timing file for case WITHOUT DATM mesh:

cat cesm_timing.ctsm51FATES_SP_OAAT_Control_2000_nuopctest_noclim.3833019.chadmin1.ib0.cheyenne.ucar.edu.231012-153640
---------------- TIMING PROFILE ---------------------
  Case        : ctsm51FATES_SP_OAAT_Control_2000_nuopctest_noclim
  LID         : 3833019.chadmin1.ib0.cheyenne.ucar.edu.231012-153640
  Machine     : cheyenne
  Caseroot    : /glade/work/afoster/FATES_calibration/FATES_SP_OAAT/ctsm51FATES_SP_OAAT_Control_2000_nuopctest_noclim
  Timeroot    : /glade/work/afoster/FATES_calibration/FATES_SP_OAAT/ctsm51FATES_SP_OAAT_Control_2000_nuopctest_noclim/Tools
  User        : afoster
  Curr Date   : Thu Oct 12 16:14:10 2023
  Driver      : CMEPS
  grid        : a%1.9x2.5_l%1.9x2.5_oi%null_r%null_g%null_w%null_z%null_m%gx1v7
  compset     : 2000_DATM%GSWP3v1_CLM51%FATES_SICE_SOCN_SROF_SGLC_SWAV_SESP
  run type    : startup, continue_run = TRUE (inittype = FALSE)
  stop option : nyears, stop_n = 10
  run length  : 3650 days (3650.0 for ocean)

  component       comp_pes    root_pe   tasks  x threads instances (stride) 
  ---------        ------     -------   ------   ------  ---------  ------  
  cpl = cpl        72          36       72     x 1       1      (1     ) 
  atm = datm       36          0        36     x 1       1      (1     ) 
  lnd = clm        72          36       72     x 1       1      (1     ) 
  ice = sice       72          36       72     x 1       1      (1     ) 
  ocn = socn       72          36       72     x 1       1      (1     ) 
  rof = srof       72          36       72     x 1       1      (1     ) 
  glc = sglc       72          36       72     x 1       1      (1     ) 
  wav = swav       72          36       72     x 1       1      (1     ) 
  esp = sesp       1           0        1      x 1       1      (1     ) 

  total pes active           : 108 
  mpi tasks per node         : 36 
  pe count for cost estimate : 108 

  Overall Metrics: 
    Model Cost:               6.72   pe-hrs/simulated_year 
    Model Throughput:       385.70   simulated_years/day 

    Init Time   :       5.998 seconds 
    Run Time    :    2240.084 seconds        0.614 seconds/day 
    Final Time  :       1.129 seconds 

Runs Time in total seconds, seconds/model-day, and model-years/wall-day 
CPL Run Time represents time in CPL pes alone, not including time associated with data exchange with other components 

    TOT Run Time:    2240.084 seconds        0.614 seconds/mday       385.70 myears/wday 
    CPL Run Time:     281.403 seconds        0.077 seconds/mday      3070.33 myears/wday 
    ATM Run Time:    1009.835 seconds        0.277 seconds/mday       855.59 myears/wday 
    LND Run Time:    1311.185 seconds        0.359 seconds/mday       658.95 myears/wday 
    ICE Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday 
    OCN Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday 
    ROF Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday 
    GLC Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday 
    WAV Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday 
    ESP Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday 
    CPL COMM Time:   1044.302 seconds        0.286 seconds/mday       827.35 myears/wday 
   NOTE: min:max driver timers (seconds/day):   
                            CPL (pes 36 to 107) 
                                                ATM (pes 0 to 35) 
                                                                                      LND (pes 36 to 107) 
                                                                                      ICE (pes 36 to 107) 
                                                                                      OCN (pes 36 to 107) 
                                                                                      ROF (pes 36 to 107) 
                                                                                      GLC (pes 36 to 107) 
                                                                                      WAV (pes 36 to 107) 
                                                ESP (pes 0 to 0) 
mvertens commented 1 year ago

@adrifoster - what kind of timings were you getting with MCT? @jedwards4b - why are we seeing about 385 ypd - when the atm and land are running concurrently? Not sure where the cpl comm time is coming in. Can you have a look at this?

adrifoster commented 1 year ago

At @ekluzek's suggestion I am running a set of tests with MCT vs. nuopc; full vs. sparse grid; and full DATM vs. subset to see the differences in timing. Please hold on those results

adrifoster commented 1 year ago

In the meantime here is an older timing file I have for an old MCT run:

---------------- TIMING PROFILE ---------------------
  Case        : ctsm51FATES_PPE_Control_2000
  LID         : 2267387.chadmin1.ib0.cheyenne.ucar.edu.230725-150029
  Machine     : cheyenne
  Caseroot    : /glade/work/afoster/CLM_PPE_FATES/ctsm51FATES_PPE_Control_2000
  Timeroot    : /glade/work/afoster/CLM_PPE_FATES/ctsm51FATES_PPE_Control_2000/Tools
  User        : afoster
  Curr Date   : Tue Jul 25 15:30:36 2023
  Driver      : CPL7
  grid        : a%1.9x2.5_l%1.9x2.5_oi%null_r%null_g%null_w%null_z%null_m%gx1v7
  compset     : 2000_DATM%CRUv7_CLM51%FATES-SP_SICE_SOCN_SROF_SGLC_SWAV_SIAC_SESP
  run type    : startup, continue_run = FALSE (inittype = TRUE)
  stop option : nyears, stop_n = 10
  run length  : 3650 days (3649.9791666666665 for ocean)

  component       comp_pes    root_pe   tasks  x threads instances (stride) 
  ---------        ------     -------   ------   ------  ---------  ------  
  cpl = cpl        72          36       72     x 1       1      (1     ) 
  atm = datm       36          0        36     x 1       1      (1     ) 
  lnd = clm        72          36       72     x 1       1      (1     ) 
  ice = sice       72          36       72     x 1       1      (1     ) 
  ocn = socn       72          36       72     x 1       1      (1     ) 
  rof = srof       72          36       72     x 1       1      (1     ) 
  glc = sglc       72          36       72     x 1       1      (1     ) 
  wav = swav       72          36       72     x 1       1      (1     ) 
  iac = siac       1           0        1      x 1       1      (1     ) 
  esp = sesp       1           0        1      x 1       1      (1     ) 

  total pes active           : 108 
  mpi tasks per node         : 36 
  pe count for cost estimate : 108 

  Overall Metrics: 
    Model Cost:               5.29   pe-hrs/simulated_year 
    Model Throughput:       489.81   simulated_years/day 

    Init Time   :      37.219 seconds 
    Run Time    :    1763.942 seconds        0.483 seconds/day 
    Final Time  :       0.001 seconds 

    Actual Ocn Init Wait Time     :       0.000 seconds 
    Estimated Ocn Init Run Time   :       0.000 seconds 
    Estimated Run Time Correction :       0.000 seconds 
      (This correction has been applied to the ocean and total run times) 

Runs Time in total seconds, seconds/model-day, and model-years/wall-day 
CPL Run Time represents time in CPL pes alone, not including time associated with data exchange with other components 

    TOT Run Time:    1763.942 seconds        0.483 seconds/mday       489.81 myears/wday 
    CPL Run Time:      53.851 seconds        0.015 seconds/mday     16044.27 myears/wday 
    ATM Run Time:     511.348 seconds        0.140 seconds/mday      1689.65 myears/wday 
    LND Run Time:    1362.955 seconds        0.373 seconds/mday       633.92 myears/wday 
    ICE Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday 
    OCN Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday 
    ROF Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday 
    GLC Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday 
    WAV Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday 
    IAC Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday 
    ESP Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday 
    CPL COMM Time:   1248.110 seconds        0.342 seconds/mday       692.25 myears/wday 
   NOTE: min:max driver timers (seconds/day):   
                            CPL (pes 36 to 107) 
                                                ATM (pes 0 to 35) 
                                                                                      LND (pes 36 to 107) 
                                                                                      ICE (pes 36 to 107) 
                                                                                      OCN (pes 36 to 107) 
                                                                                      ROF (pes 36 to 107) 
                                                                                      GLC (pes 36 to 107) 
                                                                                      WAV (pes 36 to 107) 
                                                IAC (pes 0 to 0) 
                                                ESP (pes 0 to 0) 

  CPL:CLOCK_ADVANCE           0.005:   0.006 
  CPL:LNDPREP                 0.001:   0.014 
  CPL:C2L                        <---->                                                 0.003:   0.012 
  CPL:LND_RUN                                                                           0.179:   0.373 
  CPL:L2C                                                                               7.073: 559.467 
  CPL:LNDPOST                 0.000:   0.000 
  CPL:FRACSET                 0.000:   0.001 
  CPL:ATM_RUN                                     0.136:   0.140 
  CPL:A2C                        <---->           0.089:   0.342 
  CPL:ATMPOST                 0.000:   0.000 
  CPL:RESTART                 0.000:   0.000 
  CPL:HISTORY                 0.000:   0.000 
  CPL:TSTAMP_WRITE            0.000:   0.000 
  CPL:TPROF_WRITE             0.000:   0.000 
  CPL:RUN_LOOP_BSTOP          0.000:   0.000 

More info on coupler timing:

  CPL:LNDPREP                 0.001:   0.014 
  CPL:lndprep_atm2lnd         0.000:   0.013 
  CPL:lndprep_mrgx2l          0.000:   0.000 

  CPL:LNDPOST                 0.000:   0.000 

  CPL:ATMPOST                 0.000:   0.000 
ekluzek commented 1 year ago

So what I see with above is that the cost and throughput overall improve by just 4%. That comes from about a 10% improvement in DATM.

Since, the difference in the DATM gridpoints should be about a quarter of a million vs 400, I would expect a MUCH larger difference for DATM. So I think we must NOT really be coupling with the 400 point grid -- but a half degree grid. It's faster because there are fewer points to read and interpolate from. But, it must be taking the 400 points and doing nearest neighbor interpolation to the f09 grid which it then gives to CLM. What we want to pass between DATM and CLM in the CPL is just the 400 points. If we can get it setup that way, that should make a marked difference in timing.

adrifoster commented 1 year ago

Since, the difference in the DATM gridpoints should be about a quarter of a million vs 400, I would expect a MUCH larger difference for DATM. So I think we must NOT really be coupling with the 400 point grid -- but a half degree grid.

do you mean in the case were we use the full DATM files but the mesh is the 400 pt lnd_mesh?

adrifoster commented 1 year ago

it's actually faster for the one we don't use the updated DATM mesh. Or am I reading that wrong?

ekluzek commented 1 year ago

The top case is the one with 400 point mesh for DATM and CLM right? And the second one is 400 points for CLM, but full grid for DATM. And with that I got them exactly backwards. Thanks for the correction!

The top case has...

Model Cost:               7.01   pe-hrs/simulated_year 
Model Throughput:       369.95   simulated_years/day
ATM Run Time:    1108.467 seconds        0.304 seconds/mday       779.45 myears/wday

And the second is:

Overall Metrics: Model Cost: 6.72 pe-hrs/simulated_year Model Throughput: 385.70 simulated_years/day ATM Run Time: 1009.835 seconds 0.277 seconds/mday 855.59 myears/wday

So the top is both more expensive, and at slower throughput overall and even for DATM.

I totally don't get why the top is slower than the bottom. You might run the same thing a few times to see what the machine spread looks like. It could be the difference is small enough that machine randomness happened to come into play here. If you can double check your cases to make sure, and then maybe we should all look at them.

(Moral always show your work...)

ekluzek commented 1 year ago

On having the coupler only send the 400 points rather than the full f09 grid -- we'd need to set it up with a mesh that only includes the 400 points. That's a bit different than we were thinking about this here....

adrifoster commented 1 year ago

Okay so now my runs with NUOPC on the sparse grid are failing again...

Is this because i'm using an older CTSM tag?

/glade/scratch/afoster/nuopc_grid_fullDATM/run and /glade/scratch/afoster/nuopc_grid_400DATM/run

jedwards4b commented 1 year ago

@adrifoster I'm surprised that any version works with that mesh file - it's netcdf4 and needs to be converted to a cesm compatible format. I've done that in /glade/scratch/jedwards/PFS.f19_g17.2000_DATM%GSWP3v1_CLM51%FATES_SICE_SOCN_SROF_SGLC_SWAV.cheyenne_intel.20231013_131729_eze25v/run/lnd_mesh.nc

adrifoster commented 1 year ago

Oh that's weird, they did work before... trying with your update, thank you!

jedwards4b commented 1 year ago

@adrifoster I am running for 1 year and see the run rate at the beginning of the run rate = 450.37 ypd, but there is a steady decline over the course of the run and it's only rate = 52.47 ypd at the end - have you had that same experience?

adrifoster commented 1 year ago

I have not noticed that. Are you running with fates_sp=.true.? In the user_nl_clm file?

On Fri, Oct 13, 2023 at 3:31 PM Jim Edwards @.***> wrote:

@adrifoster https://github.com/adrifoster I am running for 1 year and see the run rate at the beginning of the run rate = 450.37 ypd, but there is a steady decline over the course of the run and it's only rate = 52.47 ypd at the end - have you had that same experience?

— Reply to this email directly, view it on GitHub https://github.com/ESCOMP/CTSM/issues/1731#issuecomment-1762249238, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADE42IVWCFQOVIZXAV2TA2LX7GXJVANCNFSM5UXJD46Q . You are receiving this because you were mentioned.Message ID: @.***>

jedwards4b commented 1 year ago

That made a huge difference - thank you!

jedwards4b commented 1 year ago

My two cases are /glade/scratch/jedwards/PFS.f19_g17.2000_DATM%GSWP3v1_CLM51%FATES_SICE_SOCN_SROF_SGLC_SWAV.cheyenne_intel.20231013_131729_eze25v and /glade/scratch/jedwards/PFS.f19_g17.2000_DATM%GSWP3v1_CLM51%FATES_SICE_SOCN_SROF_SGLC_SWAV.cheyenne_intel.20231013_133543_9gysi0 The first one (eze25v) is using the new lnd_mesh.nc files. The other (9gysi0) is using the default mesh files. Otherwise the two cases are the same. Performance is markedly different:


Model Cost:               7.21   pe-hrs/simulated_year 
Model Throughput:       599.37   simulated_years/day 

vs

Model Cost:              38.53   pe-hrs/simulated_year 
Model Throughput:       156.98   simulated_years/day
adrifoster commented 1 year ago

Also sorry my other case (nuopc_grid_400DATM) failed I think because I need to do the same thing you did to the lnd_mesh.nc file to the DATM mesh I reference in my user_nl_datm_streams file.

What did you do to make the file compatible?

On Fri, Oct 13, 2023 at 4:39 PM Jim Edwards @.***> wrote:

My two cases are

— Reply to this email directly, view it on GitHub https://github.com/ESCOMP/CTSM/issues/1731#issuecomment-1762326858, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADE42ISU6OHVY5CFUQTIH2LX7G7L5ANCNFSM5UXJD46Q . You are receiving this because you were mentioned.Message ID: @.***>

ekluzek commented 1 year ago

@adrifoster a simple way to convert files away from NetCDF-4 is to use nccopy. I usually convert to CDF5 format.

nccopy -k cdf5

You can also check a NetCDF's file format with the "-k" option like

ncdump -k

adrifoster commented 1 year ago

Okay here are the results:

driver land grid DATM grid cost throughput
mct full full DATM 53.73 659.32
mct grid full DATM 6.2 418.35
mct grid grid DATM 3.56 676.56
nuopc full full DATM 67.502 525.9
nuopc grid full DATM 6.61 391.9
nuopc grid grid DATM 6.73 385.34

so the gridded DATM in MCT seems to results in almost half as fast runs

But for NUOPC we aren't seeing that yet. likely becauase as @ekluzek we are still reading in the full DATM data.

slevis-lmwg commented 1 year ago

I modified the mesh file from global with 259200 elements and mask = 1 at the 400 cells of the sparse grid to a 400-element vector same as the domain and datm files that @adrifoster was using in the mct version of this work.

In matlab, I read variables from the global datm_mesh/lnd_mesh.nc file and the 400-element domain file...

elementMask = ncread('lnd_mesh.nc','elementMask');
elementArea = ncread('lnd_mesh.nc','elementArea');
centerCoords = ncread('lnd_mesh.nc','centerCoords');
numElementConn = ncread('lnd_mesh.nc','numElementConn');
elementConn = ncread('lnd_mesh.nc','elementConn');
nodeCoords = ncread('lnd_mesh.nc','nodeCoords');
xc = ncread('/glade/p/cgd/tss/people/oleson/modify_domain/domain.lnd.360x720_domain.lnd.360x720_gswp3.0v1.c170606.dense400.nc','xc');
yc = ncread('/glade/p/cgd/tss/people/oleson/modify_domain/domain.lnd.360x720_domain.lnd.360x720_gswp3.0v1.c170606.dense400.nc','yc');

for cell = 1:400
[i] = find(centerCoords(1,:) == xc(cell) & centerCoords(2,:) == yc(cell));
centerCoords_new(:,cell) = centerCoords(:,i);
elementArea_new(cell,:) = elementArea(i,:);
elementConn_new(:,cell) = elementConn(:,i);
elementMask_new(cell,:) = elementMask(i,:);
nodeCoords_new(:,cell) = nodeCoords(:,i);    
numElementConn_new(cell) = numElementConn(i);
end

nccreate('lnd_mesh_400.nc','centerCoords','Dimensions',{'coordDim', 2, 'elementCount', 400})          
ncwrite('lnd_mesh_400.nc','centerCoords',centerCoords_new)
nccreate('lnd_mesh_400.nc','elementArea','Dimensions',{'elementCount', 400})        
ncwrite('lnd_mesh_400.nc','elementArea',elementArea_new);    
nccreate('lnd_mesh_400.nc','elementConn','Dimensions',{'maxNodePElement', 4, 'elementCount', 400})        
ncwrite('lnd_mesh_400.nc','elementConn',elementConn_new);    
nccreate('lnd_mesh_400.nc','elementMask','Dimensions',{'elementCount', 400})        
ncwrite('lnd_mesh_400.nc','elementMask',elementMask_new);    
nccreate('lnd_mesh_400.nc','nodeCoords','Dimensions',{'coordDim', 2, 'nodeCount', 400})        
ncwrite('lnd_mesh_400.nc','nodeCoords',nodeCoords_new);    
nccreate('lnd_mesh_400.nc','numElementConn','Dimensions',{'elementCount', 400})          
ncwrite('lnd_mesh_400.nc','numElementConn',numElementConn_new);      

New file is /glade/scratch/slevis/temp_work/sparse_grid/datm_mesh/lnd_mesh_400.nc @adrifoster please try this new file. This reaches my limit for now. If this one does not work, then I will wait until Wednesday's meeting before trying anything else.

Remembered to convert to cdf5 after I read the following comments...

jedwards4b commented 1 year ago

@adrifoster you will need to convert this before using it in cesm. ncdump -k /glade/cheyenne/scratch/slevis/temp_work/sparse_grid/datm_mesh/lnd_mesh_400.nc netCDF-4 classic model

@slevis-lmwg I'm pretty sure that there is an option in matlab to create cdf5 (NETCDF3_64bit_data) files.

ekluzek commented 1 year ago

@adrifoster just a suggestion. For testing performance there is a PFS test that we use. You might setup one of those and compare to your case to do similar things.

The one thing that that test does that I know about is to turn off history and restart writing (by only running 20 days). This is helpful to reduce machine randomness that can be high for I/O. The PFS test might do other things as well.

Then of course all the simulations would need to be run over again. I expect the end result to look similar, but good to do...