JeffersonLab / SRO-RTDP

1 stars 0 forks source link

Superfacility real-time at NERSC #17

Closed faustus123 closed 9 months ago

faustus123 commented 11 months ago

Investigate the Superfacility near-real-time experiment support at NERSC. Find out what is needed to use it for testing purposes with RTDP.

This may need a new project at NERSC. Learn about the EJFAT+ERSAP (modified CLARA engines) test already being configured (ask Vardan) to understand what is being done there and how we might reproduce it under RTDP.

cissieAB commented 11 months ago

@tsaie79 is looking into SuperFacility API to submit jobs

cissieAB commented 11 months ago

If we looked into the NERSC SuperFacility 2022 report, Section 3.6, page 29, the real-time scheduling is achieved by setting aside some special high priority partition ("Reservations + preemption") in Slurm.

tsaie79 commented 11 months ago

It seems that one can use QOS="realtime" to reserve the nodes. As expected, it is not available for everyone.

cissieAB commented 11 months ago

One talk about SuperFacility API: https://www.youtube.com/watch?v=dmbBJmMUErU. Start from ~25:50, there is a Jupyter notebook walking through the process to get authentication and submit jobs to Cori supercomputer using API endpoints.

faustus123 commented 11 months ago

OK. Looking at the video it seems the API is a python module we can run locally to interact with what is essentially a Python environment on their internal login node. We can invoke commands in that environment that they have predefined, passing python dictionaries to specify arguments to the commands. The commands include one which can take an executable argument whose value is a shell script that gets run on the login node.

Jobs are configured as regular SLURM scripts and submitted through their api by pointed to the script on the NERSC filesystem.

Looking at the following link, I agree that it looks like the only difference for realtime jobs is that the queue is "very high" priority. https://docs.nersc.gov/jobs/policy/#selecting-a-queue

I think the next step here is to try one of the "high" priority queues. We can do a lot of testing with this even with them being limited to only a few nodes for a few hours.

We will also try using the JIRIAF interface which, in principle, will give faster turn-around than "medium" priority for large jobs.

I think we should get an initial allocation for RTDP that all of us can charge to. Maybe that should be a separate issue though.

cissieAB commented 11 months ago

Some useful resource:

tsaie79 commented 10 months ago

I just got the Green permission to use the API. For Green, I got it right away when I applied.