Open ndkeen opened 3 years ago
pnetcdf on Theta + latest maint-1.0
: cray-parallel-netcdf/1.12.0.1
> /opt/cray/pe/parallel-netcdf/1.12.0.1/bin/pnetcdf-config --all
This PnetCDF 1.12.0 was built with the following features:
--has-c++ -> yes
--has-fortran -> yes
--netcdf4 -> disabled
--adios -> disabled
--relax-coord-bound -> enabled
--in-place-swap -> auto
--erange-fill -> enabled
--subfiling -> enabled
--large-single-req -> disabled
--null-byte-header-padding -> disabled
--burst-buffering -> enabled
--profiling -> disabled
--thread-safe -> disabled
--debug -> disabled
This PnetCDF 1.12.0 was built using the following compilers and flags:
--cc -> cc
--cxx -> CC
--f77 -> ftn
--fc -> ftn
--cppflags ->
--cflags ->
--cxxflags ->
--fflags ->
--fcflags ->
--ldflags ->
--libs ->
This PnetCDF 1.12.0 has been installed under the following directories:
--prefix -> /opt/cray/pe/parallel-netcdf/1.12.0.1/INTEL/19.1
--includedir -> /opt/cray/pe/parallel-netcdf/1.12.0.1/include
--libdir -> /opt/cray/pe/parallel-netcdf/1.12.0.1/INTEL/19.1/lib
Additional information:
--version -> PnetCDF 1.12.0
--release-date -> September 30, 2019
--config-date -> Tue May 19 00:29:21 CDT 2020
pnetcdf on Chrysalis + latest maint-1.0
: parallel-netcdf/1.11.0-b74wv4m
$ /gpfs/fs1/soft/chrysalis/spack/opt/spack/linux-centos8-x86_64/intel-20.0.4/parallel-netcdf-1.11.0-b74wv4m/bin/pnetcdf-config --all
This PnetCDF 1.11.0 was built with the following features:
--has-c++ -> yes
--has-fortran -> yes
--netcdf4 -> disabled
--relax-coord-bound -> enabled
--in-place-swap -> auto
--erange-fill -> enabled
--subfiling -> disabled
--large-single-req -> disabled
--null-byte-header-padding -> disabled
--burst-buffering -> disabled
--profiling -> disabled
--thread-safe -> disabled
--debug -> disabled
This PnetCDF 1.11.0 was built using the following compilers and flags:
--cc -> /gpfs/fs1/soft/chrysalis/spack/opt/spack/linux-centos8-x86_64/intel-20.0.4/intel-mpi-2019.9.304-tkzvizk/compilers_and_libraries_2020.4.304/linux/mpi/intel64/bin/mpiicc
--cxx -> /gpfs/fs1/soft/chrysalis/spack/opt/spack/linux-centos8-x86_64/intel-20.0.4/intel-mpi-2019.9.304-tkzvizk/compilers_and_libraries_2020.4.304/linux/mpi/intel64/bin/mpiicpc
--f77 -> /gpfs/fs1/soft/chrysalis/spack/opt/spack/linux-centos8-x86_64/intel-20.0.4/intel-mpi-2019.9.304-tkzvizk/compilers_and_libraries_2020.4.304/linux/mpi/intel64/bin/mpiifort
--fc -> /gpfs/fs1/soft/chrysalis/spack/opt/spack/linux-centos8-x86_64/intel-20.0.4/intel-mpi-2019.9.304-tkzvizk/compilers_and_libraries_2020.4.304/linux/mpi/intel64/bin/mpiifort
--cppflags ->
--cflags -> -fPIC
--cxxflags -> -fPIC
--fflags -> -fPIC
--fcflags -> -fPIC
--ldflags ->
--libs ->
This PnetCDF 1.11.0 has been installed under the following directories:
--prefix -> /gpfs/fs1/soft/chrysalis/spack/opt/spack/linux-centos8-x86_64/intel-20.0.4/parallel-netcdf-1.11.0-b74wv4m
--includedir -> /gpfs/fs1/soft/chrysalis/spack/opt/spack/linux-centos8-x86_64/intel-20.0.4/parallel-netcdf-1.11.0-b74wv4m/include
--libdir -> /gpfs/fs1/soft/chrysalis/spack/opt/spack/linux-centos8-x86_64/intel-20.0.4/parallel-netcdf-1.11.0-b74wv4m/lib
Additional information:
--version -> PnetCDF 1.11.0
--release-date -> 19 Dec 2018
--config-date -> Tue Jan 5 05:57:29 CST 2021
There is a more recent version: module load parallel-netcdf/1.12.1-kstkfoc
.
Parallel-netcdf can also be configured with
--enable-large-single-req
Enable large (> 2 GiB) single request in individual
MPI-IO calls. Note some MPI-IO libraries may not
support this. [default: disabled]
--enable-subfiling Enable subfiling support. [default: disabled]
--disable-erange-fill Disable use of fill value when out-of-range type
conversion causes NC_ERANGE error. [default:
enabled]
@ndk I would also recommend trying out pnetcdf 1.12.1 (parallel-netcdf/1.12.1-kstkfoc) and see if it works for the case above.
I tried using parallel-netcdf-1.12.1-kstkfoc
instead of the default. It did not fix the issue -- still zeros in the file.
casedir:
/lcrc/group/e3sm/ac.ndkeen/scratch/chrys/maint10-mar24/v1hires.ne120np4_oRRS18to6v3_ICG.A_WCYCL1950S_CMIP6_HR.n058a.prod-unc06.n058a.pnet12
Thanks @ndk, I am able to recreate the issue using the run script (slightly modified - modifications not relevant to the issue) above. I am trying out some experiments and will keep the issue updated.
@ndk, meanwhile was this the smallest PE layout that you could recreate the issue?
@jayeshkrishna yes I noted in the original comment that there was also an example of this behavior with 58 node layout. Since this issue is dependent on PE layout, here are the layouts that have failed and worked:
These all show the same issue:
#209 nodes 64x1
MAX_MPITASKS_PER_NODE=64
MAX_TASKS_PER_NODE=128
NTASKS_ATM=10816
ROOTPE_ATM=0
NTASKS_LND=1600
ROOTPE_LND=8192
NTASKS_ICE=9600
ROOTPE_ICE=0
NTASKS_OCN=2560
ROOTPE_OCN=10816
NTASKS_CPL=10816
ROOTPE_CPL=0
NTASKS_ROF=1024
ROOTPE_ROF=9792
NTHREADS=1
#109 nodes 64x1
MAX_MPITASKS_PER_NODE=64
MAX_TASKS_PER_NODE=128
NTASKS_ATM=5440
ROOTPE_ATM=0
NTASKS_LND=4672
ROOTPE_LND=0
NTASKS_ICE=5120
ROOTPE_ICE=0
NTASKS_OCN=1536
ROOTPE_OCN=5440
NTASKS_CPL=5440
ROOTPE_CPL=0
NTASKS_ROF=768
ROOTPE_ROF=4672
NTHREADS=1
#58 nodes 64x1
MAX_MPITASKS_PER_NODE=64
MAX_TASKS_PER_NODE=128
NTASKS_ATM=2752
ROOTPE_ATM=0
NTASKS_LND=1984
ROOTPE_LND=0
NTASKS_ICE=2560
ROOTPE_ICE=0
NTASKS_OCN=960
ROOTPE_OCN=2752
NTASKS_CPL=2752
ROOTPE_CPL=0
NTASKS_ROF=256
ROOTPE_ROF=0
NTHREADS=1
Where the following do not. These are all stacked layouts (which my experiments show perform very well). Note, I've tried many different layouts, but only for 5-day speed tests. These are the only ones I've run for at least a month (which unfortunately may be the only way to see the zero-in-mpas-file error)
#64 nodes stacked 64x1
MAX_MPITASKS_PER_NODE=64
MAX_TASKS_PER_NODE=128
NTASKS_ATM=4096
ROOTPE_ATM=0
NTASKS_LND=4096
ROOTPE_LND=0
NTASKS_ICE=4096
ROOTPE_ICE=0
NTASKS_OCN=4096
ROOTPE_OCN=0
NTASKS_CPL=4096
ROOTPE_CPL=0
NTASKS_ROF=4096
ROOTPE_ROF=0
NTHREADS=1
#128 nodes stacked 32x2
MAX_MPITASKS_PER_NODE=32
MAX_TASKS_PER_NODE=64
NTASKS_ATM=4096
ROOTPE_ATM=0
NTASKS_LND=4096
ROOTPE_LND=0
NTASKS_ICE=4096
ROOTPE_ICE=0
NTASKS_OCN=4096
ROOTPE_OCN=0
NTASKS_CPL=4096
ROOTPE_CPL=0
NTASKS_ROF=4096
ROOTPE_ROF=0
NTHREADS=2
Thanks @ndkeen
@ndkeen : Can you try the latest master of Scorpio and see if the issue persists?
I tried the case with the version of Scorpio on maint-1.0 (maint-1.0 has v1.0.1) and saw the zero values in the output (timeMonthly_avg_activeTracers_temperature and several other variables in mpaso.hist.am.timeSeriesStatsMonthly.*.nc had all zero values). However the issue is not reproducible with the latest Scorpio master (and most likely v1.2.1 on E3SM master) + maint-1.0 (v1.0.0-266-g092ea1aa3 + scorpio-v1.2.1-11-g4a44ffc4). I tried both the 109 and 58 node cases above, the ones that failed for you, with maint-1.0 + the latest Scorpio master and did not see any apparent issues (no zero values) with the data in the MPAS monthly output. The latest master of Scorpio (and Scorpio v1.2.1 on E3SM master) has several fixes for ultra high resolution simulations that might be related to this case.
To try the latest master of Scorpio,
> cd <E3SM_MAINT-1.0_SOURCE_DIR>
> cd externals/scorpio
> git fetch origin
> git checkout master
The successful cases on Chrysalis,
The 109 node case (maint-1.0 + Scorpio master : v1.0.0-266-g092ea1aa3 + scorpio-v1.2.1-11-g4a44ffc4): /lcrc/group/e3sm/jayesh/scratch/chrys/v1hires.ne120np4_oRRS18to6v3_ICG.A_WCYCL1950S_CMIP6_HR.n109a.prod-unc06g-nodbg-spio-master
The 58 node case (maint-1.0 + Scorpio master : v1.0.0-266-g092ea1aa3 + scorpio-v1.2.1-11-g4a44ffc4): /lcrc/group/e3sm/jayesh/scratch/chrys/v1hires.ne120np4_oRRS18to6v3_ICG.A_WCYCL1950S_CMIP6_HR.n109a.prod-unc06g-nodbg-spio-master-58nodesPE
I should have mentioned this earlier. For testing of standalone MPAS-Ocean, we have a test case where we configure the timeSeriesStatsDaily
to have the same output as timeSeriesStatsMonthly
that we use for debugging CF-compliant output.
This approach would likely make your debugging easier here, too. Set the necessary namelist options to turn on timeSeriesStatsDaily
and make an identical stream in the streams.ocean
with the same output but with Monthly
--> Daily
. I'm not super expert in how to alter the namelists and streams files in E3SM so I'm hoping you can figure that part out.
thanks for the great idea @xylar! @ndkeen if you want to try this for your testing to allow for shorter tests (we could run 1 or 2 days instead of a month), I can help you set this up in E3SM. Let me know.
Jayesh: OK that's good news that latest scorpio seems to not have the issue. We would need to discuss if replacing this code in the middle of simulation campaign is the right thing to do or not.
Xylar/Luke: Yes, that could help for future testing. We might need to decide what to do with this simulation.
I just made a PR to bring in a bugfix for the ROF restart names (which apparently only happens in certain situations and is fine otherwise... ?). With this fix and a change to PE layout (which is performing better), it looks like the simulation is OK.
Our v1 highres control production run (as well as the transient run) that we moved from theta to chrysalis was found to have zeros in the
mpaso.hist.am.timeSeriesStatsMonthly.*.nc
file. These runs use the maint-1.0 repo and I've verified the issue is the same for a repo of January 2021 as well as March 18th 2021. The files written on Theta do not show the issue. After a fair amount of testing documented here:https://acme-climate.atlassian.net/wiki/spaces/SIM/pages/1025639219/Control+Run+HighRes+MIP+theta.20190910.branch+noCNT.A+WCYCL1950S+CMIP6+HR.ne120+oRRS18v3+ICG
Im now able to repeat the issue given the script below.
The output for the job I ran with this script is here:
To verify if a mpas timeseries file has zeros or not, I've found the following command useful (this output shows zeros):
It may be a combination of the PE layout and the PIO settings:
I've also verified that a 58-node layout also exhibits same behavior (zeros in the mpas file). Here is a link to a similar script as above except is uses 58 nodes and is slower.
/lcrc/group/e3sm/ac.ndkeen/wacmy/maint10-mar18/cime/scripts/prod-unc06.n058a.csh
And a when I comment the 2 PIO settings, the same scripts works:
/lcrc/group/e3sm/ac.ndkeen/wacmy/maint10-mar18/cime/scripts/prod-unc06.n109.piodef.csh
It's possible these settings are involved, or, if the root cause is something like memory corruption, then those 2 PIO settings may just be perturbing memory enough to cause different behavior.I'm trying to further narrow down the issue, but as the issue requires a full month of high res, each job is several hours.