E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
348 stars 359 forks source link

On PM-CPU, WaveWatchIII text input file takes too long to read in. #6670

Open erinethomas opened 1 week ago

erinethomas commented 1 week ago

WW3 requires two large text files to be read in during the wave model initialization (the unresolved obstacles files). These files are stored on global CFS with the rest of the E3SM data: /global/cfs/cdirs/e3sm/inputdata/wav/ww3/obstructions_local.wQU225Icos30E3r5sp36x36.rtd.in. (size = 348M) /global/cfs/cdirs/e3sm/inputdata/wav/ww3/obstructions_shadow.wQU225Icos30E3r5sp36x36.rtd.in (size = 626M)

These files take too long to be read in (often, simulations result in errors due to the wall clock time running out for short tests). For example, I have run the following tests: 1) a fully coupled E3SMv3+WW3 test on 8 nodes. The model initialization time is 2722 seconds (~ 45minutes) 2) a fully coupled E3SMv3+WW3 test on 4 nodes. The model initialization time is 1365 seconds These initialization times are WAY too long. It also suggests the problem scales with the number of nodes (twice as many nodes take about twice as much time to initialize the model)

I have found two possible solutions/workarounds that possibly suggest the issue is reading the big files from the global CFS directory 1) copying the files to my local scratch directory: running on 8 nodes, this reduces the init. time to 180 seconds (about the same time observed on other machines, such as chrysalis) 2) changing DIN_LOC_ROOT to equal "/dvs_ro/cfs/cdirs/e3sm/inputdata". Running on 8 nodes, reduces the init time to 160seconds (again, similar time as observed on other machines)

erinethomas commented 1 week ago

@ndkeen @mahf708 - new issue on the large time needed on PM-CPU for reading in files by WW3- conversation/suggestions on this issue welcome.

rljacob commented 1 week ago

Why aren't these in netcdf format? You can never read that large of text file fast.

ndkeen commented 1 week ago

First, just sanity check of transfer speeds from CFS, CFS with dvs_ro, and scratch, I tried simple experiment below to show that they are all "about the same" in this scenario. I think dvs_ro is generally only faster with smaller file sizes, but... as to Rob's point, as these are text files, they are likely NOT being read in parallel method.

If possible, better to use diff file format with diff supported mechanisms to read in parallel. But if reading in text "manually", you for sure don't want to each MPI rank reading the entire file all at the same time. I don't yet know if that's happening here. Yes, this will be slower (how much slower really depends), but more importantly, it's error-prone and can cause filesystem problems (esp with increasing MPI ranks -- including other jobs trying to read same file). You could put together a quick patch to have rank0 read the file and use MPI_Bcast() to communicate data to other ranks.

time to copy file from X location to scratch on perlmutter

           CFS   dvs_ro CFS  scratch
ob local   .15    .16          .12
ob shadow  .28    .30          .22
time in seconds

perlmutter-login18% pwd
/global/homes/n/ndk/tmp

rm ob*in

perlmutter-login18% time cp /global/cfs/cdirs/e3sm/inputdata/wav/ww3/obstructions_local.wQU225Icos30E3r5sp36x36.rtd.in .
0.000u 0.149s 0:00.25 56.0% 0pf+0w

perlmutter-login18% time cp /global/cfs/cdirs/e3sm/inputdata/wav/ww3/obstructions_shadow.wQU225Icos30E3r5sp36x36.rtd.in .
0.000u 0.284s 0:00.48 58.3% 0pf+0w

rm ob*in

perlmutter-login18% time cp /dvs_ro/cfs/cdirs/e3sm/inputdata/wav/ww3/obstructions_local.wQU225Icos30E3r5sp36x36.rtd.in .
0.000u 0.158s 0:00.24 62.5% 0pf+0w

perlmutter-login18% time cp /dvs_ro/cfs/cdirs/e3sm/inputdata/wav/ww3/obstructions_shadow.wQU225Icos30E3r5sp36x36.rtd.in .
0.000u 0.298s 0:00.45 64.4% 0pf+0w

rm ob*in

perlmutter-login18% time cp /pscratch/sd/n/ndk/inputdata/wav/ww3/obstructions_local.wQU225Icos30E3r5sp36x36.rtd.in .
0.000u 0.124s 0:00.22 54.5% 0pf+0w

perlmutter-login18% time cp /pscratch/sd/n/ndk/inputdata/wav/ww3/obstructions_shadow.wQU225Icos30E3r5sp36x36.rtd.in .
0.000u 0.215s 0:00.38 55.2% 0pf+0w
sarats commented 1 week ago

As reading an ASCII or small file seems like a recurring pattern, I would suggest we put in a read and broadcast operation in Scorpio and have everyone use it rather than everyone implement this in their sub-model. Of course, it's trivial to do this right but one reusuable routine is better for maintenance. cc @jayeshkrishna

The best path is to have any input files to be read in parallel to be in netCDF.

rljacob commented 1 week ago

The better place for that would be E3SM/share/util. SCORPIO should remain focused on large scale parallel reads/writes.

erinethomas commented 1 week ago

@ndkeen - I'm pretty sure WW3 IS, in fact, reading the entire file with each task. not good.

philipwjones commented 1 week ago

@erinethomas Just for clarification - is this occurring in source under our control? i.e. Does this occur within the WWIII source? Or are these reads taking place within MPAS for use in WWIII?

erinethomas commented 1 week ago

@erinethomas Just for clarification - is this occurring in source under our control? i.e. Does this occur within the WWIII source? Or are these reads taking place within MPAS for use in WWIII?

This is happening within the WW3 source (not in MPAS) - we have a fork of WW3 source code for use within E3SM (as a submodule) that we have full control over and can modify to suit our needs.

philipwjones commented 1 week ago

So it seems like the most appropriate solution (besides changing the file location) is to modify the WWIII source. If this is reading a table or set of values that are shared by all tasks, we should do a read from master and broadcast. If the values are meant to be distributed (ie each task needs a subset of values), we should do a proper parallel I/O. Let us know if you need help - the broadcast is relatively easy, but the parallel I/O is a bit more involved.

ndkeen commented 1 week ago

A few comments: a) Looks like you are on right track and we all agree that, at least for production cases, we don't want each MPI rank reading same file in serial b) It's not clear to me if this is actually what is slowing you down. I think you said the init as a whole is faster with scratch (or using dvs_ro), but that can include other things besides reading these 2 files. I can look at your cases and learn more? and/or try to reproduce. c) Yes, I have been communicating with NERSC about using scratch (lustre) space to experiment with as a location for inputdata. It would be non-purged, but there are still some other details to work out (like unix groups -- was hoping to avoid concept of collab accounts -- ideally we want it to behave exactly same way as it does for us now in CFS). And then, start experimenting. I already have my own personal copy of oft-used inputs in my scratch space /pscratch/sd/n/ndk/inputdata that I have been occasionally experimenting with for quite a while. I've found that it sometimes helps, sometimes does not -- so can depend on what we are doing. I was actually trying to steer us toward using /dvs_ro with CFS first, but may not be worth it as global solution. I do think that the huge files we are starting to add may def be better reading from scratch in whatever way we think is best (like those for 3km scream runs). Could we have concept of small vs large file in inputdata? I will explore option of inputdata on shared scratch more with NERSC. Can make a different issue for that discussion.

ndkeen commented 2 days ago

Can you please post a way to reproduce this issue?