[NERSC] Request for very high priority queue for Sprint Week

LSSTDESC / desc-help

DESC Computing Requests

BSD 3-Clause "New" or "Revised" License

2 stars 0 forks source link

[NERSC] Request for very high priority queue for Sprint Week #71

Closed RickKessler closed 3 years ago

RickKessler commented 3 years ago

Description A clear and concise description of what the issue is.

For sprint week, can DESC members have a VERY high priority queue for <10 minute jobs ? This would really help us to actually "sprint" without getting stalled.

Choose all applicable topics by placing an 'X' between the [ ]:

[ ] jupyter
[ ] jupyter terminal
[ ] Cori interactive command line
[ x] Batch jobs
[ ] python
[ ] CSCRATCH
[ ] Community File System
[ ] HPSS (tape)
[ ] Data transfer and Globus
[ ] New User Account or account access problems

To Reproduce Steps to reproduce the behavior:

Go to '...'
Click on '....'
Scroll down to '....'
See error

Screenshots If applicable, add screenshots to help explain your problem.

heather999 commented 3 years ago

Hi @RickKessler Before presenting this to NERSC, I think we would have to make a case, specifically, could the interactive queue serve this purpose? The wait time to get connected should be < 6 minute and can handle jobs up to 4 hours.

We also may just want to generally let NERSC know that there will be increased use of Jupyter and batch queues during Sprint week, perhaps with some specific days/time frames noted.

heather999 commented 3 years ago

@RickKessler Do you think the interactive queue might work for your purposes at Sprint Week?

RickKessler commented 3 years ago

it might. I tried using interactive queue before (following online NERSC doc) and didn't succeed.

heather999 commented 3 years ago

We should probably try to sort that out. We can try to do that here if you can send a log of what happens and then we can probably just chat for 15 minutes on Zoom to figure it out (hopefully)!

RickKessler commented 3 years ago

I tried interactive and indeed it works. However, the issue I'm concerned about is, let's say, a 1 hr job that we would like to distribute over 12 cores to have a result in 5 minutes.

cwwalter commented 3 years ago

Also remember "Additionally, each NERSC project is further limited to a total of 64 nodes between all their interactive jobs (KNL or haswell)". That counts for all of DESC. So, might see if that can be relaxed next week if the plan is to rely on that.

joezuntz commented 3 years ago

@RickKessler I don't quite follow your worry about the interactive queue - you can absolutely just get 12 cores (or 12 nodes) for a 5 minute job, and even if the maximum time requested was 1 hour we'd only be charged for 5 minutes. Would that work for you or did I misunderstand?

RickKessler commented 3 years ago

hmmm, I don't know how to get multiple cores in the interactive queue. The issue with getting 12 cores in the nominal queue is we often have to wait a while ... which is fine during normal work, but disruptive during sprint week.

joezuntz commented 3 years ago

I don't think it's possible to be allocated less than 32 cores on cori interactive. If you do:

salloc -N 1 -C haswell  -t 1:00:00 -q interactive -A m1727

Then you get allocated one node, with 32 cores (and 2 threads per core) on it.

If you want to use MPI you then use srun instead of mpirun, or you can use e.g. python multiprocessing with up to 32 processes without doing that instead of MPI. You can also use thread-like parallelism (e.g. using OpenMP-enabled codes) directly too.