Open ndkeen opened 1 month ago
xref https://github.com/E3SM-Project/E3SM/issues/5953 --- I hope we can fix that too when we fix this. In general, I find the P3 table business to be quite complicated and caused a bit of issues for me and others.
Is this data something that could be placed directly on the heap at init time?
Yes; also my understanding is that the whole table could actually be calculated at runtime ...
Oh that would even be better. The file in question is I think this one. It's 32MB and 31k lines (of text floats).
perlmutter-login37% ls -lh /global/cfs/cdirs/e3sm/inputdata/atm/scream/tables/p3_lookup_table_1.dat-v4.1.1
-rw-rw-r--+ 1 ndk ndk 3.2M Sep 28 11:37 /global/cfs/cdirs/e3sm/inputdata/atm/scream/tables/p3_lookup_table_1.dat-v4.1.1
A very similar issue is occurring with E3SM+WW3 coupled simulations on perlmutter. It takes over 45 minutes to initialize the wave model with the current settings. This problem with WW3 reading the input file also linearly scales with the number of nodes requested : double the number of nodes = twice the initialization time. This is very problematic. The only solution I have found is moving the file WAVEWATCHIII reads out of the CFS directory.
Erin: For this issue, I'm just wanting to find a solution for a specific place in the code where it's clear we are having all rank read the same file. In general, that's a bad idea, but in practice, might only run into problems with large MPI's. It sounds like what you are seeing is general slowness of reading files from CFS, which could be caused by different things -- though if you do know all ranks reading it serially, you certainly want to fix that. Reading from scratch would still have the same problem. It might be better to create a new issue to describe the problem you are seeing and how to reproduce.
hi Noel - Thanks for the feedback - I can make a new issue. Reading the file from scratch does NOT seem to have the same issue. so it seems to be just CFS.
hi Noel - Thanks for the feedback - I can make a new issue. Reading the file from scratch does NOT seem to have the same issue. so it seems to be just CFS.
I too would be very curious to see your example, so I hope you could open an issue and point us to the code and reproducer. I assume you don't see any issue on chrysalis or any other hpc, right?
@erinethomas when you have a chance, could you please test with the file on CFS but change the type to cdf5? Long story short, I had a similar issue (but for me, it was getting completely stuck in reading the file) and then when I move the file from CFS to SCRATCH, it worked. When I changed the file type from classic to cdf5, it also worked.
@mahf708 - the file is an ascii text file... I will open a new issue to further discuss the specifics soon.
NERSC has even suggesting we move our inputdata from CFS (which uses DVS) to scratch (Lustre). They have said we can have scratch space that is not purged for this purpose.
I think this is good to do as Lustre is better suited for this purpose.
A branch to read P3 lookup table by 1 rank and broadcast to others is here.
ndk/p3/read-txt-table-with-1rank
For cases that use P3 (note I assumed there were such cases in E3SM, but I'm not currently finding any in the set I've been testing...), we are reading a small text file in a poor parallel method (by letting each MPI rank read the same file). I was surprised to find we are doing this and surely it was a mistake as this is never a good idea. While file is small, it still causes issues with the filesystems and NERSC admins are noticing. It could also cause a slowdown (or even stall/hang).
I have been testing a quick fix to have rank 0 read the file and broadcast data to others, which seems to be BFB, but will need work to properly implement.
NERSC has even suggesting we move our inputdata from CFS (which uses DVS) to scratch (Lustre). They have said we can have scratch space that is not purged for this purpose. In general, I've been testing performance of reading from scratch and it seems about the same, but if sole reason for moving is to avoid complications such as this, hopefully we can just fix.
I also made an issue in scream (will link) as same problem exists there, but implementation of fix may be slightly different.