Open paciorek opened 1 year ago
Sorry for the delay. I think we need to identify what combinations of (ntasks, nodes, cpus_per_task)
exist and what to expect from availableCores()
and availableWorkers()
.
For a starter, I've created a script (sbatch-params-all.R.txt) that sbatch
submits a job script with different combinations of (ntasks, nodes, cpus_per_task)
and gathers Slurm environment variables as well as availableCores()
and availableWorkers()
for each such job. Here's the result from some cases:
# A tibble: 25 × 13
ntasks nodes cpus_per_task HOSTNAME NTASKS JOB_NUM_NODES JOB_NODELIST JOB_CPUS_PER_NODE TASKS_PER_NODE CPUS_PER_TASK CPUS_ON_NODE availableCores availableWorkers
<int> <chr> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
1 NA NA NA c4-n12 NA 1 c4-n12 2 2 NA 2 2 2*c4-n12
2 1 NA NA c4-n12 1 1 c4-n12 2 1 NA 2 2 2*c4-n12
3 1 1 NA c4-n12 1 1 c4-n12 2 1 NA 2 2 2*c4-n12
4 1 NA 1 c4-n12 1 1 c4-n12 2 1 1 2 1 1*c4-n12
5 1 1 1 c4-n12 1 1 c4-n12 2 1 1 2 1 1*c4-n12
6 2 NA NA c4-n12 2 1 c4-n12 2 2 NA 2 2 2*c4-n12
7 2 NA 1 c4-n12 2 1 c4-n12 2 2 1 2 1 1*c4-n12
8 2 NA 2 c4-n12 2 1 c4-n12 4 2 2 4 2 2*c4-n12
9 2 1 NA c4-n12 2 1 c4-n12 2 2 NA 2 2 2*c4-n12
10 2 1 1 c4-n12 2 1 c4-n12 2 2 1 2 1 1*c4-n12
11 2 1 2 c4-n12 2 1 c4-n12 4 2 2 4 2 2*c4-n12
12 2 2 NA c4-n12 2 2 c4-n[12-13] 2(x2) 1(x2) NA 2 1 2*c4-n12, 2*c4-n13
13 2 2 1 c4-n12 2 2 c4-n[12-13] 2(x2) 1(x2) 1 2 1 1*c4-n12, 1*c4-n13
14 2 2 2 c4-n12 2 2 c4-n[12-13] 2(x2) 1(x2) 2 2 2 2*c4-n12, 2*c4-n13
15 4 NA 2 c4-n12 4 1 c4-n12 8 4 2 8 2 2*c4-n12
16 4 2 2 c4-n12 4 2 c4-n[12-13] 6,2 3,1 2 6 2 2*c4-n12, 2*c4-n13
17 16 NA NA c4-n12 16 1 c4-n12 16 16 NA 16 16 16*c4-n12
18 16 NA 4 c4-n12 16 2 c4-n[12-13] 40,24 10,6 4 40 4 4*c4-n12, 4*c4-n13
19 16 1 NA c4-n13 16 1 c4-n13 16 16 NA 16 16 16*c4-n13
20 16 4 NA c4-n1 16 4 c4-n[1-4] 10,2(x3) 10,2(x3) NA 10 10 10*c4-n1, 2*c4-n2, 2*c4-n3, 2*c4-n4
21 16 4 1 c4-n1 16 4 c4-n[1-4] 10,2(x3) 10,2(x3) 1 10 1 1*c4-n1, 1*c4-n2, 1*c4-n3, 1*c4-n4
22 NA 1-2 8 c4-n12 NA 2 c4-n[12-13] 8(x2) 1(x2) 8 8 8 8*c4-n12, 8*c4-n13
23 NA 2 8 c4-n12 NA 2 c4-n[12-13] 8(x2) 1(x2) 8 8 8 8*c4-n12, 8*c4-n13
24 NA 2 8 c4-n12 NA 2 c4-n[12-13] 8(x2) 1(x2) 8 8 8 8*c4-n12, 8*c4-n13
25 NA 4 8 c4-n1 NA 4 c4-n[1-4] 8(x4) 1(x4) 8 8 8 8*c4-n1, 8*c4-n2, 8*c4-n3, 8*c4-n4
PS. I've dropped the SLURM_
prefix for the env vars in this table.
Here's a cleaned up version (sbatch-params-all.R.txt). I'm now sorting by SLURM_CPUS_PER_TASK
== cpus_per_task
:
# A tibble: 25 × 10
ntasks nodes cpus_per_task JOB_NUM_NODES JOB_NODELIST JOB_CPUS_PER_NODE CPUS_ON_NODE TASKS_PER_NODE availableCores availableWorkers
<int> <chr> <int> <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
1 NA NA NA 1 c4-n12 2 2 2 2 2*c4-n12
2 1 NA NA 1 c4-n12 2 2 1 2 2*c4-n12
3 1 1 NA 1 c4-n12 2 2 1 2 2*c4-n12
4 16 NA NA 1 c4-n12 16 16 16 16 16*c4-n12
5 16 1 NA 1 c4-n13 16 16 16 16 16*c4-n13
6 16 4 NA 4 c4-n[3-5,37] 10,2(x3) 10 10,2(x3) 10 10*c4-n3, 2*c4-n37, 2*c4-n4, 2*c4-n5
7 2 NA NA 1 c4-n12 2 2 2 2 2*c4-n12
8 2 1 NA 1 c4-n12 2 2 2 2 2*c4-n12
9 2 2 NA 2 c4-n[12-13] 2(x2) 2 1(x2) 1 2*c4-n12, 2*c4-n13
10 1 NA 1 1 c4-n12 2 2 1 1 1*c4-n12
11 1 1 1 1 c4-n12 2 2 1 1 1*c4-n12
12 16 4 1 4 c4-n[3-5,37] 10,2(x3) 10 10,2(x3) 1 1*c4-n3, 1*c4-n37, 1*c4-n4, 1*c4-n5
13 2 NA 1 1 c4-n12 2 2 2 1 1*c4-n12
14 2 1 1 1 c4-n12 2 2 2 1 1*c4-n12
15 2 2 1 2 c4-n[12-13] 2(x2) 2 1(x2) 1 1*c4-n12, 1*c4-n13
16 2 NA 2 1 c4-n12 4 4 2 2 2*c4-n12
17 2 1 2 1 c4-n12 4 4 2 2 2*c4-n12
18 2 2 2 2 c4-n[12-13] 2(x2) 2 1(x2) 2 2*c4-n12, 2*c4-n13
19 4 NA 2 1 c4-n12 8 8 4 2 2*c4-n12
20 4 2 2 2 c4-n[12-13] 6,2 6 3,1 2 2*c4-n12, 2*c4-n13
21 16 NA 4 2 c4-n[12-13] 40,24 40 10,6 4 4*c4-n12, 4*c4-n13
22 NA 1-2 8 2 c4-n[12-13] 8(x2) 8 1(x2) 8 8*c4-n12, 8*c4-n13
23 NA 2 8 2 c4-n[12-13] 8(x2) 8 1(x2) 8 8*c4-n12, 8*c4-n13
24 NA 2 8 2 c4-n[12-13] 8(x2) 8 1(x2) 8 8*c4-n12, 8*c4-n13
25 NA 4 8 4 c4-n[3-4,38-39] 8(x4) 8 1(x4) 8 8*c4-n3, 8*c4-n38, 8*c4-n39, 8*c4-n4
This is a fun little puzzle!
Here's one possible approach that tries to have availableCores
and availableWorkers
follow closely Slurm's conception of "CPUS" and "tasks".
availableCores
report the total number of cores available across all nodes
JOB_CPUS_PER_NODE
, so it wouldn't always directly relate to --cpus-per-task
(e.g., if a node is scheduled exclusively but a user omitted --cpus-per-task
or set it to lower than the number of cores on the node)multisession
and multicore
plans start as many workers as given by availableCores
by default (*).availableWorkers
report as many workers as Slurm tasks.
--ntasks
(or --ntasks-per-node
) is missing, availableWorkers
reports as many workers as Slurm nodes since Slurm sets one task per node in this casecluster
plan sets workers
to the output of availableWorkers
by default (**)(*) This is not good in the case of a multi-node allocation and a user using the multisession or multicore plan, but this would mostly be user error in understanding how distributed computing works. Perhaps issue a warning or do something else in that case. If one instead had availableCores
be based only on the primary node, a user would have idle nodes, so this is not good either.
(**) A user might think they are parallelizing across all available cores in this case, but they wouldn't be. One could issue a warning, but it would trigger falsely in the case of using threading (e.g., BLAS) nested within future's workers.
(Disclaimer: I'm not thinking about multithreading and threads per CPU core at all here)
Hi. Thanks. Some quick comments for clarification/additional constraints:
- have
availableCores
report the total number of cores available across all nodes
The design and purpose of availableCores()
is to "Get Number of Available Cores on The Current Machine". It is meant to take the place of parallel::detectCores()
, which is the common go-to for identifying the number of cores available, but which also comes with quite a few potholes (https://www.jottr.org/2022/12/05/avoid-detectcores/).
So, availableCores()
should return the number of cores available on the current machine. Looking at the code, which evolved over time, it's quite clear the current strategy is complex and convoluted. I haven't had time to fully digest the idea, but maybe availableCores()
should return CPUS_ON_NODE
. That should report on the number of cores that Slurm has given the job to work with on that machine. Now, to avoid over using, it's of course important that the user does not spin up sub-jobs via srun
and alike.
Regarding availableWorkers()
- "Get Set of Available Workers":
have availableWorkers report as many workers as Slurm tasks.
If I understand you correctly, yes, I think with:
workers <- availableWorkers()
nworkers <- length(workers)
workers
should expand to the set defined by JOB_NODELIST
, and thereby, nworkers
should also be equal to the sum of JOB_CPUS_PER_NODE
. That way, for instance,
plan(cluster, workers = availableWorkers())
which sets up a PSOCK cluster like
workers <- availableWorkers()
cl <- parallelly::makeClusterPSOCK(workers)
will result in length(nworkers)
parallel workers available to the parent R session. This requires SSH access from the main compute host to the other compute hosts (although with argument rsh
other protocols can be used too). Again, as before, if the code does this, it's important that srun
and likes doesn't spin up additional processes.
cluster
plan sets workers to the output ofavailableWorkers
by default () () A user might think they are parallelizing across all available cores in this case, but they wouldn't be. One could issue a warning, but it would trigger falsely in the case of using threading (e.g., BLAS) nested within future's workers.
I'm not sure, I follow here. See above example saying it should indeed set up length(availableWorkers())
workers.
Regarding availableCores
, having that report the number of CPUs on the current node (from CPUS_ON_NODE
) makes sense to me.
Regarding availableWorkers
, I think the question is whether JOB_NODELIST
is expanded based on JOB_CPUS_PER_NODE
or expanded based on TASKS_PER_NODE
. I think the latter makes more sense. Why?
future
'workers' are akin to Slurm (and MPI) 'tasks'. CPUS_PER_TASK
if they use --cpus-per-task
. length(availableWorkers())
currently generally equals the sum of JOB_CPUS_PER_NODE
. Regarding availableWorkers, I think the question is whether JOB_NODELIST is expanded based on JOB_CPUS_PER_NODE or expanded based on TASKS_PER_NODE. ...
Thanks for this. Let me digest this (=find some deep focus time to think more about it).
- Also, from your table it does not look to me that length(availableWorkers()) currently generally equals the sum of JOB_CPUS_PER_NODE.
Correct. availableWorkers()
does not work as it should on Slurm; it gives too few hostnames in some cases. The code definitely over-complicates things. I think there's a history for why it ended up being implemented that way. Hopefully, it's in the issue tracker(s) somewhere.
So, before making any changes to availableWorkers()
, I think it's good to revisit why it's working as it does right now. Because it's important to get this one correct, and not yet another semi-faulty version, I'll punt on this one until after the next release, which I'm working on right now.
Just an updated run with more combinations:
# A tibble: 27 × 10
ntasks nodes cpus_per_task SLURM_JOB_NUM_NODES SLURM_JOB_NODELIST SLURM_JOB_CPUS_PER_NODE SLURM_CPUS_ON_NODE SLURM_TASKS_PER_NODE availableCores availableWorkers
<int> <chr> <int> <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
1 NA NA NA 1 c4-n12 2 2 2 2 2*c4-n12
2 1 NA NA 1 c4-n12 2 2 1 2 2*c4-n12
3 1 1 NA 1 c4-n12 2 2 1 2 2*c4-n12
4 2 NA NA 1 c4-n12 2 2 2 2 2*c4-n12
5 2 1 NA 1 c4-n13 2 2 2 2 2*c4-n13
6 2 2 NA 2 c4-n[12-13] 2(x2) 2 1(x2) 1 2*c4-n12, 2*c4-n13
7 16 NA NA 1 c4-n12 16 16 16 16 16*c4-n12
8 16 1 NA 1 c4-n12 16 16 16 16 16*c4-n12
9 16 4 NA 4 c4-n[1-4] 10,2(x3) 10 10,2(x3) 10 10*c4-n1, 2*c4-n2, 2*c4-n3, 2*c4-n4
10 1 NA 1 1 c4-n12 2 2 1 1 1*c4-n12
11 1 1 1 1 c4-n12 2 2 1 1 1*c4-n12
12 2 NA 1 1 c4-n12 2 2 2 1 1*c4-n12
13 2 1 1 1 c4-n13 2 2 2 1 1*c4-n13
14 2 2 1 2 c4-n[12-13] 2(x2) 2 1(x2) 1 1*c4-n12, 1*c4-n13
15 16 4 1 4 c4-n[1-4] 10,2(x3) 10 10,2(x3) 1 1*c4-n1, 1*c4-n2, 1*c4-n3, 1*c4-n4
16 2 NA 2 1 c4-n13 4 4 2 2 2*c4-n13
17 2 1 2 1 c4-n13 4 4 2 2 2*c4-n13
18 2 2 2 2 c4-n[12-13] 2(x2) 2 1(x2) 2 2*c4-n12, 2*c4-n13
19 4 NA 2 1 c4-n12 8 8 4 2 2*c4-n12
20 4 2 2 2 c4-n[12-13] 6,2 6 3,1 2 2*c4-n12, 2*c4-n13
21 16 NA 3 1 c4-n13 48 48 16 3 3*c4-n13
22 16 NA 4 2 c4-n[12-13] 48,16 48 12,4 4 4*c4-n12, 4*c4-n13
23 16 NA 5 2 c4-n[12-13] 46,36 46 9,7 5 5*c4-n12, 5*c4-n13
24 NA 1-2 8 2 c4-n[12-13] 8(x2) 8 1(x2) 8 8*c4-n12, 8*c4-n13
25 NA 2 8 2 c4-n[12-13] 8(x2) 8 1(x2) 8 8*c4-n12, 8*c4-n13
26 NA 2 8 2 c4-n[12-13] 8(x2) 8 1(x2) 8 8*c4-n12, 8*c4-n13
27 NA 4 8 4 c4-n[1-4] 8(x4) 8 1(x4) 8 8*c4-n1, 8*c4-n2, 8*c4-n3, 8*c4-n4
Note to self: List also what CGroups and nproc
report for each config.
Now with nproc
and cgroups
allocations too:
> print(data2, n = 100L)
# A tibble: 27 × 14
ntasks cpus_per_task nodes NTASKS CPUS_PER_TASK JOB_NUM_NODES JOB_NODELIST TASKS_PER_NODE JOB_CPUS_PER_NODE CPUS_ON_NODE availableCores availableWorkers nproc cgroups
<int> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <int> <dbl>
1 NA NA NA NA NA 1 c4-n13 2 2 2 2 2*c4-n13 2 2
2 1 NA NA 1 NA 1 c4-n13 1 2 2 2 2*c4-n13 2 2
3 1 NA 1 1 NA 1 c4-n13 1 2 2 2 2*c4-n13 2 2
4 2 NA NA 2 NA 1 c4-n13 2 2 2 2 2*c4-n13 2 2
5 2 NA 1 2 NA 1 c4-n13 2 2 2 2 2*c4-n13 2 2
6 2 NA 2 2 NA 2 c4-n[2-3] 1(x2) 2(x2) 2 1 2*c4-n2, 2*c4-n3 2 2
7 16 NA NA 16 NA 1 c4-n13 16 16 16 16 16*c4-n13 16 16
8 16 NA 1 16 NA 1 c4-n4 16 16 16 16 16*c4-n4 16 16
9 16 NA 4 16 NA 4 c4-n[4-5,37,39] 2(x2),10,2 2(x2),10,2 2 2 10*c4-n37, 2*c4-n39, 2*c4-n4, 2*c4-n5 2 2
10 1 1 NA 1 1 1 c4-n13 1 2 2 1 1*c4-n13 2 2
11 1 1 1 1 1 1 c4-n13 1 2 2 1 1*c4-n13 2 2
12 2 1 NA 2 1 1 c4-n13 2 2 2 1 1*c4-n13 2 2
13 2 1 1 2 1 1 c4-n13 2 2 2 1 1*c4-n13 2 2
14 2 1 2 2 1 2 c4-n[3-4] 1(x2) 2(x2) 2 1 1*c4-n3, 1*c4-n4 2 2
15 16 1 4 16 1 4 c4-n[4-5,10,39] 7,4,3,2 8,4,3,2 8 1 1*c4-n10, 1*c4-n39, 1*c4-n4, 1*c4-n5 8 8
16 2 2 NA 2 2 1 c4-n13 2 4 4 2 2*c4-n13 4 4
17 2 2 1 2 2 1 c4-n13 2 4 4 2 2*c4-n13 4 4
18 2 2 2 2 2 2 c4-n[3-4] 1(x2) 2(x2) 2 2 2*c4-n3, 2*c4-n4 2 2
19 4 2 NA 4 2 1 c4-n13 4 8 8 2 2*c4-n13 8 8
20 4 2 2 4 2 2 c4-n[3-4] 3,1 6,2 6 2 2*c4-n3, 2*c4-n4 6 6
21 16 3 NA 16 3 2 c4-n[3-4] 12,4 36,12 36 3 3*c4-n3, 3*c4-n4 36 36
22 16 4 NA 16 4 1 c4-n38 16 64 64 4 4*c4-n38 64 64
23 16 5 NA 16 5 1 c4-n39 16 80 80 5 5*c4-n39 80 80
24 NA 8 1-2 NA 8 1 c4-n13 1 8 8 8 8*c4-n13 8 8
25 NA 8 2 NA 8 2 c4-n[2-3] 1(x2) 8(x2) 8 8 8*c4-n2, 8*c4-n3 8 8
26 NA 8 2 NA 8 2 c4-n[3-4] 1(x2) 8(x2) 8 8 8*c4-n3, 8*c4-n4 8 8
27 NA 8 4 NA 8 4 c4-n[3,37-39] 1(x4) 8(x4) 8 8 8*c4-n3, 8*c4-n37, 8*c4-n38, 8*c4-n39 8 8
This was generated using sbatch-params-all.R.txt
I noticed that in this case, available workers is based on
-c
(--cpus-per-task
) and not-n
(--ntasks
).I think in this case, it would be more natural to have it report 4 workers, with the user probably wanting to use threading (e.g., linear algebra or Rcpp+openMP) within each of the 4 workers. That said, there may be cases where returning 2 available workers makes most sense.
However, the result above seems inconsistent with this next result (after all, why should it matter how many nodes the 4 tasks are running on?):
That said, I can imagine handling all the various ways a user might use
--nodes
(-N
),--ntasks
(-n
), and--cpus-per-task
(-c
) might be tricky...EDIT 2022-12-13: Add long options formats for clarification. /Henrik