GEOS-ESM / MAPL

MAPL is a foundation layer of the GEOS architecture, whose original purpose is to supplement the Earth System Modeling Framework (ESMF)
https://geos-esm.github.io/MAPL/
Apache License 2.0
26 stars 17 forks source link

Question: Does MAPL ExtData read all external files on the first node? #1981

Open 1Dandan opened 1 year ago

1Dandan commented 1 year ago

Description

I am a GCHP user. GCHP models rely on MAPL ExtData to read external files including offline meteorology and emissions. When I run GCHP simulations on NASA Pleiades, I encountered significant slowness (much lower throughput) several times. As the slowness happened so randomly and I was running global model with comparable complexities, the largest suspicion I have is I/O. NASA support team says the writing is done first to buffer cache and probably not the limitation, while file reading could be the source of slowness. They suggest me to copy some external files to /tmp directory of computing node to speed up reading process. As the PBS scripts are run on the first node before mpiexec command, if the file reading is only done on the first node for MAPL ExtData, we can simply add a line to copy some large hdf5 files to /tmp before running mpiexec to speed up the process.

My question is: Does MAPL ExtData read all external files only on the first node?

tclune commented 1 year ago

The answer is a bit more nuanced, but the short answer is "no".

First, in the I/O configuration used in GCHP, each process will read in the portion of the global array that is needed on the local process. Here "local" is in terms of how the ExtData component decides to distribute the array across processes. This is generally different than the distribution for the actual arrays in the model as, er, the grids are different. But in any event everyone reads.

Note also that NetCDF chunking plays a role here. As each process attempts to read the "local" portion of the file data, it must actually read any netcdf chunks that overlap. So if the file just has 1 chunk, every process reads the entire array. For GEOS we try to chunk our input files, but I have no idea how consistently this is done. (@bena-nasa can you summarize?)

We have no doubt that our I/O strategies are not optimal for the Lustre file system at NAS. In particular, we know output is much slower in GEOS when we run at NAS, but we think what we have is close to optimal at the NCCS. Is it possible to tune things (outside the model) to get better I/O performance at NAS? Maybe? Can we redesign our I/O layer to get more robust performance at NAS? Probably?

If you know any Lustre experts that might be willing to brainstorm with us on the topic, we would be interested. We have some big projects (1.5 km global resolutino) in the pipeline that will be running at NAS in the near future.