E3SM-Project / scream

Fork of E3SM used to develop exascale global atmosphere model written in C++
https://e3sm-project.github.io/scream/
Other
76 stars 55 forks source link

Cases using 128 vertical levels on frontier not BFB #3053

Open ndkeen opened 3 hours ago

ndkeen commented 3 hours ago

With Oct 16th checkout, I was trying to do some performance tests on frontier. I noticed that every single case had different bfbhashes than the others. Trying to find a create_test reproducer led to a different issue: the standard tests we use are all using 72 vertical levels. But for all cases of interest, we use 128 levels. It looks like cases using 72 levels are BFB, while those using 128 levels are not.

These pass:

ERS.ne30pg2_ne30pg2.F2010-SCREAMv1.frontier-scream-gpu_craygnuamdgpu
ERS_P64.ne30pg2_ne30pg2.F2010-SCREAMv1.frontier-scream-gpu_craygnuamdgpu.scream-small_kernels
PEM_P64.ne30pg2_ne30pg2.F2010-SCREAMv1.frontier-scream-gpu_craygnuamdgpu.scream-small_kernels
PET.ne30pg2_ne30pg2.F2010-SCREAMv1.frontier-scream-gpu_craygnuamdgpu
PET_P32.ne30pg2_ne30pg2.F2010-SCREAMv1.frontier-scream-gpu_craygnuamdgpu.scream-small_kernels
REP_P16.ne30pg2_ne30pg2.F2010-SCREAMv1.frontier-scream-gpu_craygnuamdgpu
REP_P16x6.ne30pg2_ne30pg2.F2010-SCREAMv1.frontier-scream-gpu_craygnuamdgpu

And these fail:

REP_P16.ne30pg2_ne30pg2.F2010-SCREAMv1.frontier-scream-gpu_craygnuamdgpu.scream-L128
ERS_P1024.ne256pg2_ne256pg2.F2010-SCREAMv1.frontier-scream-gpu_craygnuamdgpu.scream-small_kernels
PEM_P1024.ne256pg2_ne256pg2.F2010-SCREAMv1.frontier-scream-gpu_craygnuamdgpu.scream-small_kernels

Note that @mahf708 found location where we are setting default levels of resolutions 256,512,1024 to be 128, and then still at 72 for the others. Therefore, to get 128 levels with ne30, I'm using the scream-L128 test modifier. This is not quite working with ne4 (to make even simpler reproducer). There is a SCREAM-HR, but I ran into runtime issues using this with ne30.

whannah1 commented 3 hours ago

The test mod for ne4 is probably not working because the initial condition file is not being updated to reflect the change in vertical levels.

mahf708 commented 2 hours ago

The test mod for ne4 is probably not working because the initial condition file is not being updated to reflect the change in vertical levels.

@whannah1 there's no IC for ne4, that's why. Would you be kind to make one? I could, but I'd rather let someone who knows this stuff handle it. There's a vert profile in there, so you could simple ncremap the ne4 l72 to be ne4 l128

whannah1 commented 2 hours ago

@mahf708 sure, I can make one.

bartgol commented 2 hours ago

Why do we need to run 128 levels at ne4?

whannah1 commented 1 hour ago

We don't really need this other than it might help to figure out the non-BFB test results that Noel mentioned. also, generating a ne4 L128 IC file is trivial.

mahf708 commented 1 hour ago

Looks like we do have L128 ne4np4 files but somehow I missed them (or maybe Walter added them?)

-rwxr-xr-x 1          20457 E3SM  20M May 18  2022 screami_ne4np4L128_20220512.nc
-rwxr-xr-x 1          20457 E3SM  20M May 24  2022 screami_ne4np4L128_20220524.nc
-rwxr-xr-x 1          20457 E3SM  11M May 24  2022 screami_ne4np4L72_20220524.nc
-rw-rw-r-- 1 ac.brhillman   E3SM 6.5M Jul  1  2022 screami_ne4np4L72_20220701.nc
-rw-rw-r-- 1 ac.brhillman   E3SM 5.1M Aug 18  2022 screami_ne4np4L72_20220818.nc
-rw-rw-r-- 1 ac.brhillman   E3SM 5.1M Aug 23  2022 screami_ne4np4L72_20220823.nc
-rw-rw-r-- 1 ac.jgfouca     E3SM 5.1M Jul 13  2023 screami_ne4np4L72_20230712.nc

Next step is to edit these xml entries to point to one of those L128 files for ne4 grids

https://github.com/E3SM-Project/scream/blob/cb432e84ccf6cb9231328f9b26fcbe08b399f696/components/eamxx/cime_config/namelist_defaults_scream.xml#L489-L496

whannah1 commented 48 minutes ago

@mahf708 I don't think I created or added those - but I just created a new one and uploaded it to chrysalis - I was going to run a test on Frontier