Closed RickKessler closed 3 years ago
Hi @RickKessler Before presenting this to NERSC, I think we would have to make a case, specifically, could the interactive queue serve this purpose? The wait time to get connected should be < 6 minute and can handle jobs up to 4 hours.
We also may just want to generally let NERSC know that there will be increased use of Jupyter and batch queues during Sprint week, perhaps with some specific days/time frames noted.
@RickKessler Do you think the interactive queue might work for your purposes at Sprint Week?
it might. I tried using interactive queue before (following online NERSC doc) and didn't succeed.
We should probably try to sort that out. We can try to do that here if you can send a log of what happens and then we can probably just chat for 15 minutes on Zoom to figure it out (hopefully)!
I tried interactive and indeed it works. However, the issue I'm concerned about is, let's say, a 1 hr job that we would like to distribute over 12 cores to have a result in 5 minutes.
Also remember "Additionally, each NERSC project is further limited to a total of 64 nodes between all their interactive jobs (KNL or haswell)". That counts for all of DESC. So, might see if that can be relaxed next week if the plan is to rely on that.
@RickKessler I don't quite follow your worry about the interactive queue - you can absolutely just get 12 cores (or 12 nodes) for a 5 minute job, and even if the maximum time requested was 1 hour we'd only be charged for 5 minutes. Would that work for you or did I misunderstand?
hmmm, I don't know how to get multiple cores in the interactive queue. The issue with getting 12 cores in the nominal queue is we often have to wait a while ... which is fine during normal work, but disruptive during sprint week.
I don't think it's possible to be allocated less than 32 cores on cori interactive. If you do:
salloc -N 1 -C haswell -t 1:00:00 -q interactive -A m1727
Then you get allocated one node, with 32 cores (and 2 threads per core) on it.
If you want to use MPI you then use srun
instead of mpirun
, or you can use e.g. python multiprocessing with up to 32 processes without doing that instead of MPI. You can also use thread-like parallelism (e.g. using OpenMP-enabled codes) directly too.
Description A clear and concise description of what the issue is.
For sprint week, can DESC members have a VERY high priority queue for <10 minute jobs ? This would really help us to actually "sprint" without getting stalled.
Choose all applicable topics by placing an 'X' between the [ ]:
To Reproduce Steps to reproduce the behavior:
Screenshots If applicable, add screenshots to help explain your problem.