Simulation jobs on Perlmutter running more than x5 slower than before.

RickKessler commented 1 year ago

Description A clear and concise description of what the issue is.

Compared to benchmark speed tests in Feb/Mar/Apr, recent sim jobs sent to slurm run about x6 slower. See outputs in /global/cfs/cdirs/lsst/groups/TD/SN/SNANA/SURVEYS/LSST/ROOT/ELASTICC/survey_config/SIMLOGS_ELASTICC_TEST09_LSST_WFD_SLOW.

Choose all applicable topics by placing an 'X' between the [ ]:

[ ] jupyter
[ ] jupyter terminal
[ ] Cori interactive command line
[X ] Batch jobs
[ ] python
[ ] CSCRATCH
[ ] Community File System
[ ] HPSS (tape)
[ ] Data transfer and Globus
[ ] New User Account or account access problems

To Reproduce Steps to reproduce the behavior:

Go to '...'
Click on '....'
Scroll down to '....'
See error

There is a clue that file system is part of the problem. Each simulation task starts at a random place in the SIMLIB/cadence file, and reading to near the end takes ~minute (sometimes 2 minutes) compared to few seconds interactively; grep BATCH TMP_kess_ELASTICC_TEST09_LSSTWFD0011LOG | grep seconds

heather999 commented 1 year ago

Hi Rick Those SIMLIB/cadence files are on CFS right? What we know from NERSC is that the CFS read/write performance was expected to be degraded to what it was over the past few months due to changes they announced on June 9th:

We’re mounting the NGF systems using a tool called DVS, which is an I/O forwarder that is expected to be more stable than mounting the systems natively. DVS should eliminate most of the issues where NGF was slow or hanging on Perlmutter. We’ll run under this new DVS configuration for about a week, and if it proves to eliminate the instability issues, we’ll use this configuration for the near future.

The new configuration will introduce some changes:
Global common will now be mounted read-only on the compute nodes. This means any processes that need to write to global common will need to move to the login nodes. (This is the same as it was on Cori.)

Reads and writes to CFS (and your home directory) will become slower.

Now - I cannot explain why the IO performance may be different when using the login nodes versus the compute nodes. That may be worth asking - but as far as a fix.. I suspect NERSC's immediate response would be that we need to try moving/copying all the IO to PSCRATCH and then copy back the output to CFS for longer term storage. So my current plan to ask NERSC about whether they are indeed still using DVS for the foreseeable future, ask if there are any known differences in IO performance on CFS on login vs compute nodes.. and then we can go from there..

RickKessler commented 1 year ago

All sim output is written to /pscratch, but input libraries are stored on CFS with the assumption that they are backed up. I think the bottleneck is reading thru this 4GB SIMLIB/cadence file, $SNANA_LSST_ROOT/simlibs/baseline_v2.0_10yrs_WFD.simlib where it reads several hundred rows per event. All other maps & libraries are stored in memory, but the cadence library can be arbitrarily large.

heather999 commented 1 year ago

I edited my first comment to add the specific statement from NERSC that using the DVS tool that is now mounted causes: "Reads and writes to CFS (and your home directory) will become slower."

Can we try a test where these SIMLIB cadence files are put on PSCRATCH to see if that helps?

RickKessler commented 1 year ago

I moved the cadence file to /pscratch/sd/d/desctd/SIMLIB and sim-generation speed significantly improved, but is still much slower than in Feb. To compare CPU speed, I generated 10% of the ELASTICC sample and use the sum of CPU times evaluated by the native C-code time() function, and thus it does not count overhead from slurm and the slurm-control script. Here is a table of results:

date CPU_SUM comment Feb 02 CPU_SUM=18hr testing Perlmutter, and benchmark CPU time

Jun 25 CPU_SUM=105hr no changes since Feb

Jun 26 CPU_SUM=39hr read simlib/cadence from from pscratch instead of from CFS

Jun 27 ?? read simlib from /pscratch and launch jobs from /pscratch; I killed jobs after 6 hr. The 5 jobs on nid004105 finished after ~3hr wall time, slightly better than the Jun 26 wall time. The remaining 15 jobs on nid004110 run more than x2 slower. I visually examined the sim-stdout files during generation, and I could see it hanging for a long time as it reads each new SED time-series from CFS.

My hunch is that to get optimal speed we would need to move the SNANA environment ($SNDATA_ROOT, $SNANA_LSST_ROOT) to /pscratch. However, these input maps and libraries might get purged over time, or be lost in a disk failure since there is no backup. $SNDATA_ROOT is a mirror of a public repository, and thus easy to restore ... but quite annoying if files are mysteriously purged after a few months. $SNANA_LSST_ROOT contains our proprietary DESC files that are not backed up. Moving the SNANA env to /pscratch could work if

no purge in SNANA environment
there is some kind of regular backup of $SNANA_LSST_ROOT

RickKessler commented 1 year ago

Generating the full ELASTICC sample used to take about 6 hr wall time (40 cores) and was well below the 12 hr wall time limit on Perlmutter. I tried again last night, and everything timed out at 12hr and wasn't even close to finishing. For the July plan to begin streaming to brokers, I can generate the ELASTICC sample on our U.Chicago/RCC system and postpone the Perlmutter issue. However, this is not a good long-term solution. For example, project 296 will take significant resources that I had not planned to use on RCC.

heather999 commented 1 year ago

Hi Rick - First some questions - in your test last night - was this using PSCRATCH for all IO? We can pursue either or both of the options you listed yetsterday:

requesting NERSC turn off purging the SNANA environment on PSCRATCH - we just need to define a timeline and amount of disk space needed for that, let's say 6 months..
We can also do regular backup of $SNANA_LSST_ROOT either to CFS or HPSS or both

So I guess I'd like to understand the conditions of the test last night to see if these options could help.

NERSC did get back to us via the ticket and we can follow up there, but it does sound like the use of DVS is consistent with the behavior you see on CFS. In addition to the options you listed above, we could also move some portion of the files being read to that DVS mount - I need to look at that option more carefully.

Just let me know if it is worth spending time on this now or if you've decided to work on the Chicago machines.

Thanks for the report, and sorry for the headaches for you and your users. tl;dr yes, this unfortunately sounds like the results of DVS to me. (I think logins are not using DVS and computes are, which may explain the difference you reported.) You can find some more information here: https://docs.nersc.gov/performance/io/dvs/

If your users only need to read files, they might try our /dvs_ro/cfs mount which can help speed up reads. If SCRATCH works, please also don't hesitate to fill our our quota request form if you all need more space there.

Re: the future of DVS, I am not sure. DVS was not anybody's first choice, but using it did help achieve some much-needed system stability. It's hard to say how long it will remain on the system. I'll direct you to Lisa who is leading our N9 integration effort who might be able to give you better information.

Please let us know if you have more questions.

heather999 commented 1 year ago

There are some very interesting points in the NERSC doc about DVS...these things affect us. The code you are running (SNANA?) is also on CFS or is it in a container (shifter) and if not in shifter, can we put it in shifter?

If your job reads large volumes of data, the fastest file system will almost always be Perlmutter Scratch. However, if many of the processes in your jobs repeatedly read in the same file (e.g. a configuration file), you may see a large speedup by using a read-only DVS mount. On Perlmutter, both Global Common and CFS have corresponding read-only mounts at /dvs_ro/common and /dvs_ro/cfs, respectively. We recommend using these for data that is being read during a job that is not being actively changed. The DVS mount of these file systems will cache data for 5 minutes by default, so if data is being changed, you may see unexpected results.

Things to Avoid With DVS¶ Avoid ACLs¶

DVS is unable to cache extended attributes. Extended attributes are features that enable users to associate computer files with metadata not interpreted by the filesystem. The most common kind of extended attribute is an ACL, which can be used to manage complex access permissions for files. Because DVS is unable to cache these attributes, it must access the file system every time it touches the file, which can be very slow, especially at large scale. It is recommended to not use ACLs on any files or directories you need to access at scale during your batch jobs.

RickKessler commented 1 year ago

few more test results on slurm speed. First I just used shifter and it ran even slower. For the next test, I rsynced $SNDATA_ROOT and $SNANA_LSST_ROOT to /pscratch (still using shifter), and the slurm speed increased to match the Feb tests (CPU_SUM=17.5). I launched one more test without shifter ... same speed as with shifter, so doesn't matter.

heather999 commented 1 year ago

We've had some more discussion offline and things are set up now to use $PSCRATCH for these jobs. We've also seen more communication from NERSC indicating that the introduction of the DVS tool has likely impacted IO on CFS - so this issue with compute job performance is now better understood. There is a new area under /pscratch/sd/d/desctd/cfs_mirror that is currently exempt from purging. There is an alternative to use /dvs_ro at NERSC, but for now it seems $PSCRATCH is going to work to solve this issue. @RickKessler is it ok to close this or should we leave it open pending how the new set of jobs perform?

RickKessler commented 1 year ago

Yes, this issue can be closed.

LSSTDESC / desc-help

Simulation jobs on Perlmutter running more than x5 slower than before. #98