HenrikBengtsson / parallelly

R package: parallelly - Enhancing the 'parallel' Package
https://parallelly.futureverse.org
128 stars 7 forks source link

behavior of `availableWorkers` when Slurm `--nodes`, `--ntasks` and `--cpus-per-task` are provided #85

Open paciorek opened 1 year ago

paciorek commented 1 year ago

I noticed that in this case, available workers is based on -c (--cpus-per-task) and not -n (--ntasks).

paciorek@smeagol:~> srun --ntasks=4 --cpus-per-task=2 --pty bash
paciorek@scf-sm11:~> Rscript -e "parallelly::availableWorkers()"
[1] "scf-sm11" "scf-sm11"

I think in this case, it would be more natural to have it report 4 workers, with the user probably wanting to use threading (e.g., linear algebra or Rcpp+openMP) within each of the 4 workers. That said, there may be cases where returning 2 available workers makes most sense.

However, the result above seems inconsistent with this next result (after all, why should it matter how many nodes the 4 tasks are running on?):

paciorek@smeagol:~> srun --nodes=2 --ntasks=4 --cpus-per-task=2 --pty bash
paciorek@scf-sm10:~> Rscript -e "parallelly::availableWorkers()"
[1] "scf-sm10" "scf-sm10" "scf-sm11" "scf-sm11"

That said, I can imagine handling all the various ways a user might use --nodes (-N), --ntasks (-n), and --cpus-per-task (-c) might be tricky...

EDIT 2022-12-13: Add long options formats for clarification. /Henrik

HenrikBengtsson commented 1 year ago

Sorry for the delay. I think we need to identify what combinations of (ntasks, nodes, cpus_per_task) exist and what to expect from availableCores() and availableWorkers().

For a starter, I've created a script (sbatch-params-all.R.txt) that sbatch submits a job script with different combinations of (ntasks, nodes, cpus_per_task) and gathers Slurm environment variables as well as availableCores() and availableWorkers() for each such job. Here's the result from some cases:

# A tibble: 25 × 13
   ntasks nodes cpus_per_task HOSTNAME NTASKS JOB_NUM_NODES JOB_NODELIST JOB_CPUS_PER_NODE TASKS_PER_NODE CPUS_PER_TASK CPUS_ON_NODE availableCores availableWorkers                   
    <int> <chr>         <int> <chr>    <chr>  <chr>         <chr>        <chr>             <chr>          <chr>         <chr>                 <dbl> <chr>                              
 1     NA NA               NA c4-n12   NA     1             c4-n12       2                 2              NA            2                         2 2*c4-n12                           
 2      1 NA               NA c4-n12   1      1             c4-n12       2                 1              NA            2                         2 2*c4-n12                           
 3      1 1                NA c4-n12   1      1             c4-n12       2                 1              NA            2                         2 2*c4-n12                           
 4      1 NA                1 c4-n12   1      1             c4-n12       2                 1              1             2                         1 1*c4-n12                           
 5      1 1                 1 c4-n12   1      1             c4-n12       2                 1              1             2                         1 1*c4-n12                           
 6      2 NA               NA c4-n12   2      1             c4-n12       2                 2              NA            2                         2 2*c4-n12                           
 7      2 NA                1 c4-n12   2      1             c4-n12       2                 2              1             2                         1 1*c4-n12                           
 8      2 NA                2 c4-n12   2      1             c4-n12       4                 2              2             4                         2 2*c4-n12                           
 9      2 1                NA c4-n12   2      1             c4-n12       2                 2              NA            2                         2 2*c4-n12                           
10      2 1                 1 c4-n12   2      1             c4-n12       2                 2              1             2                         1 1*c4-n12                           
11      2 1                 2 c4-n12   2      1             c4-n12       4                 2              2             4                         2 2*c4-n12                           
12      2 2                NA c4-n12   2      2             c4-n[12-13]  2(x2)             1(x2)          NA            2                         1 2*c4-n12, 2*c4-n13                 
13      2 2                 1 c4-n12   2      2             c4-n[12-13]  2(x2)             1(x2)          1             2                         1 1*c4-n12, 1*c4-n13                 
14      2 2                 2 c4-n12   2      2             c4-n[12-13]  2(x2)             1(x2)          2             2                         2 2*c4-n12, 2*c4-n13                 
15      4 NA                2 c4-n12   4      1             c4-n12       8                 4              2             8                         2 2*c4-n12                           
16      4 2                 2 c4-n12   4      2             c4-n[12-13]  6,2               3,1            2             6                         2 2*c4-n12, 2*c4-n13                 
17     16 NA               NA c4-n12   16     1             c4-n12       16                16             NA            16                       16 16*c4-n12                          
18     16 NA                4 c4-n12   16     2             c4-n[12-13]  40,24             10,6           4             40                        4 4*c4-n12, 4*c4-n13                 
19     16 1                NA c4-n13   16     1             c4-n13       16                16             NA            16                       16 16*c4-n13                          
20     16 4                NA c4-n1    16     4             c4-n[1-4]    10,2(x3)          10,2(x3)       NA            10                       10 10*c4-n1, 2*c4-n2, 2*c4-n3, 2*c4-n4
21     16 4                 1 c4-n1    16     4             c4-n[1-4]    10,2(x3)          10,2(x3)       1             10                        1 1*c4-n1, 1*c4-n2, 1*c4-n3, 1*c4-n4 
22     NA 1-2               8 c4-n12   NA     2             c4-n[12-13]  8(x2)             1(x2)          8             8                         8 8*c4-n12, 8*c4-n13                 
23     NA 2                 8 c4-n12   NA     2             c4-n[12-13]  8(x2)             1(x2)          8             8                         8 8*c4-n12, 8*c4-n13                 
24     NA 2                 8 c4-n12   NA     2             c4-n[12-13]  8(x2)             1(x2)          8             8                         8 8*c4-n12, 8*c4-n13                 
25     NA 4                 8 c4-n1    NA     4             c4-n[1-4]    8(x4)             1(x4)          8             8                         8 8*c4-n1, 8*c4-n2, 8*c4-n3, 8*c4-n4 

PS. I've dropped the SLURM_ prefix for the env vars in this table.

HenrikBengtsson commented 1 year ago

Here's a cleaned up version (sbatch-params-all.R.txt). I'm now sorting by SLURM_CPUS_PER_TASK == cpus_per_task:

# A tibble: 25 × 10
   ntasks nodes cpus_per_task JOB_NUM_NODES JOB_NODELIST    JOB_CPUS_PER_NODE CPUS_ON_NODE TASKS_PER_NODE availableCores availableWorkers                    
    <int> <chr>         <int> <chr>         <chr>           <chr>             <chr>        <chr>                   <dbl> <chr>                               
 1     NA NA               NA 1             c4-n12          2                 2            2                           2 2*c4-n12                            
 2      1 NA               NA 1             c4-n12          2                 2            1                           2 2*c4-n12                            
 3      1 1                NA 1             c4-n12          2                 2            1                           2 2*c4-n12                            
 4     16 NA               NA 1             c4-n12          16                16           16                         16 16*c4-n12                           
 5     16 1                NA 1             c4-n13          16                16           16                         16 16*c4-n13                           
 6     16 4                NA 4             c4-n[3-5,37]    10,2(x3)          10           10,2(x3)                   10 10*c4-n3, 2*c4-n37, 2*c4-n4, 2*c4-n5
 7      2 NA               NA 1             c4-n12          2                 2            2                           2 2*c4-n12                            
 8      2 1                NA 1             c4-n12          2                 2            2                           2 2*c4-n12                            
 9      2 2                NA 2             c4-n[12-13]     2(x2)             2            1(x2)                       1 2*c4-n12, 2*c4-n13                  
10      1 NA                1 1             c4-n12          2                 2            1                           1 1*c4-n12                            
11      1 1                 1 1             c4-n12          2                 2            1                           1 1*c4-n12                            
12     16 4                 1 4             c4-n[3-5,37]    10,2(x3)          10           10,2(x3)                    1 1*c4-n3, 1*c4-n37, 1*c4-n4, 1*c4-n5 
13      2 NA                1 1             c4-n12          2                 2            2                           1 1*c4-n12                            
14      2 1                 1 1             c4-n12          2                 2            2                           1 1*c4-n12                            
15      2 2                 1 2             c4-n[12-13]     2(x2)             2            1(x2)                       1 1*c4-n12, 1*c4-n13                  
16      2 NA                2 1             c4-n12          4                 4            2                           2 2*c4-n12                            
17      2 1                 2 1             c4-n12          4                 4            2                           2 2*c4-n12                            
18      2 2                 2 2             c4-n[12-13]     2(x2)             2            1(x2)                       2 2*c4-n12, 2*c4-n13                  
19      4 NA                2 1             c4-n12          8                 8            4                           2 2*c4-n12                            
20      4 2                 2 2             c4-n[12-13]     6,2               6            3,1                         2 2*c4-n12, 2*c4-n13                  
21     16 NA                4 2             c4-n[12-13]     40,24             40           10,6                        4 4*c4-n12, 4*c4-n13                  
22     NA 1-2               8 2             c4-n[12-13]     8(x2)             8            1(x2)                       8 8*c4-n12, 8*c4-n13                  
23     NA 2                 8 2             c4-n[12-13]     8(x2)             8            1(x2)                       8 8*c4-n12, 8*c4-n13                  
24     NA 2                 8 2             c4-n[12-13]     8(x2)             8            1(x2)                       8 8*c4-n12, 8*c4-n13                  
25     NA 4                 8 4             c4-n[3-4,38-39] 8(x4)             8            1(x4)                       8 8*c4-n3, 8*c4-n38, 8*c4-n39, 8*c4-n4
paciorek commented 1 year ago

This is a fun little puzzle!

Here's one possible approach that tries to have availableCores and availableWorkers follow closely Slurm's conception of "CPUS" and "tasks".

(*) This is not good in the case of a multi-node allocation and a user using the multisession or multicore plan, but this would mostly be user error in understanding how distributed computing works. Perhaps issue a warning or do something else in that case. If one instead had availableCores be based only on the primary node, a user would have idle nodes, so this is not good either.

(**) A user might think they are parallelizing across all available cores in this case, but they wouldn't be. One could issue a warning, but it would trigger falsely in the case of using threading (e.g., BLAS) nested within future's workers.

HenrikBengtsson commented 1 year ago

(Disclaimer: I'm not thinking about multithreading and threads per CPU core at all here)

Hi. Thanks. Some quick comments for clarification/additional constraints:

  • have availableCores report the total number of cores available across all nodes

The design and purpose of availableCores() is to "Get Number of Available Cores on The Current Machine". It is meant to take the place of parallel::detectCores(), which is the common go-to for identifying the number of cores available, but which also comes with quite a few potholes (https://www.jottr.org/2022/12/05/avoid-detectcores/).

So, availableCores() should return the number of cores available on the current machine. Looking at the code, which evolved over time, it's quite clear the current strategy is complex and convoluted. I haven't had time to fully digest the idea, but maybe availableCores() should return CPUS_ON_NODE. That should report on the number of cores that Slurm has given the job to work with on that machine. Now, to avoid over using, it's of course important that the user does not spin up sub-jobs via srun and alike.

Regarding availableWorkers() - "Get Set of Available Workers":

have availableWorkers report as many workers as Slurm tasks.

If I understand you correctly, yes, I think with:

workers <- availableWorkers()
nworkers <- length(workers)

workers should expand to the set defined by JOB_NODELIST, and thereby, nworkers should also be equal to the sum of JOB_CPUS_PER_NODE. That way, for instance,

plan(cluster, workers = availableWorkers())

which sets up a PSOCK cluster like

workers <- availableWorkers()
cl <- parallelly::makeClusterPSOCK(workers)

will result in length(nworkers) parallel workers available to the parent R session. This requires SSH access from the main compute host to the other compute hosts (although with argument rsh other protocols can be used too). Again, as before, if the code does this, it's important that srun and likes doesn't spin up additional processes.

cluster plan sets workers to the output of availableWorkers by default () () A user might think they are parallelizing across all available cores in this case, but they wouldn't be. One could issue a warning, but it would trigger falsely in the case of using threading (e.g., BLAS) nested within future's workers.

I'm not sure, I follow here. See above example saying it should indeed set up length(availableWorkers()) workers.

paciorek commented 1 year ago

Regarding availableCores, having that report the number of CPUs on the current node (from CPUS_ON_NODE) makes sense to me.

Regarding availableWorkers, I think the question is whether JOB_NODELIST is expanded based on JOB_CPUS_PER_NODE or expanded based on TASKS_PER_NODE. I think the latter makes more sense. Why?

  1. It meshes well with how Slurm works -- future 'workers' are akin to Slurm (and MPI) 'tasks'.
  2. This then allows the user to use threaded code within a future worker and to base the number of threads on CPUS_PER_TASK if they use --cpus-per-task.
  3. Also, from your table it does not look to me that length(availableWorkers()) currently generally equals the sum of JOB_CPUS_PER_NODE.
HenrikBengtsson commented 1 year ago

Regarding availableWorkers, I think the question is whether JOB_NODELIST is expanded based on JOB_CPUS_PER_NODE or expanded based on TASKS_PER_NODE. ...

Thanks for this. Let me digest this (=find some deep focus time to think more about it).

  1. Also, from your table it does not look to me that length(availableWorkers()) currently generally equals the sum of JOB_CPUS_PER_NODE.

Correct. availableWorkers() does not work as it should on Slurm; it gives too few hostnames in some cases. The code definitely over-complicates things. I think there's a history for why it ended up being implemented that way. Hopefully, it's in the issue tracker(s) somewhere.

So, before making any changes to availableWorkers(), I think it's good to revisit why it's working as it does right now. Because it's important to get this one correct, and not yet another semi-faulty version, I'll punt on this one until after the next release, which I'm working on right now.

HenrikBengtsson commented 1 year ago

Just an updated run with more combinations:

# A tibble: 27 × 10
   ntasks nodes cpus_per_task SLURM_JOB_NUM_NODES SLURM_JOB_NODELIST SLURM_JOB_CPUS_PER_NODE SLURM_CPUS_ON_NODE SLURM_TASKS_PER_NODE availableCores availableWorkers                   
    <int> <chr>         <int> <chr>               <chr>              <chr>                   <chr>              <chr>                         <dbl> <chr>                              
 1     NA NA               NA 1                   c4-n12             2                       2                  2                                 2 2*c4-n12                           
 2      1 NA               NA 1                   c4-n12             2                       2                  1                                 2 2*c4-n12                           
 3      1 1                NA 1                   c4-n12             2                       2                  1                                 2 2*c4-n12                           
 4      2 NA               NA 1                   c4-n12             2                       2                  2                                 2 2*c4-n12                           
 5      2 1                NA 1                   c4-n13             2                       2                  2                                 2 2*c4-n13                           
 6      2 2                NA 2                   c4-n[12-13]        2(x2)                   2                  1(x2)                             1 2*c4-n12, 2*c4-n13                 
 7     16 NA               NA 1                   c4-n12             16                      16                 16                               16 16*c4-n12                          
 8     16 1                NA 1                   c4-n12             16                      16                 16                               16 16*c4-n12                          
 9     16 4                NA 4                   c4-n[1-4]          10,2(x3)                10                 10,2(x3)                         10 10*c4-n1, 2*c4-n2, 2*c4-n3, 2*c4-n4
10      1 NA                1 1                   c4-n12             2                       2                  1                                 1 1*c4-n12                           
11      1 1                 1 1                   c4-n12             2                       2                  1                                 1 1*c4-n12                           
12      2 NA                1 1                   c4-n12             2                       2                  2                                 1 1*c4-n12                           
13      2 1                 1 1                   c4-n13             2                       2                  2                                 1 1*c4-n13                           
14      2 2                 1 2                   c4-n[12-13]        2(x2)                   2                  1(x2)                             1 1*c4-n12, 1*c4-n13                 
15     16 4                 1 4                   c4-n[1-4]          10,2(x3)                10                 10,2(x3)                          1 1*c4-n1, 1*c4-n2, 1*c4-n3, 1*c4-n4 
16      2 NA                2 1                   c4-n13             4                       4                  2                                 2 2*c4-n13                           
17      2 1                 2 1                   c4-n13             4                       4                  2                                 2 2*c4-n13                           
18      2 2                 2 2                   c4-n[12-13]        2(x2)                   2                  1(x2)                             2 2*c4-n12, 2*c4-n13                 
19      4 NA                2 1                   c4-n12             8                       8                  4                                 2 2*c4-n12                           
20      4 2                 2 2                   c4-n[12-13]        6,2                     6                  3,1                               2 2*c4-n12, 2*c4-n13                 
21     16 NA                3 1                   c4-n13             48                      48                 16                                3 3*c4-n13                           
22     16 NA                4 2                   c4-n[12-13]        48,16                   48                 12,4                              4 4*c4-n12, 4*c4-n13                 
23     16 NA                5 2                   c4-n[12-13]        46,36                   46                 9,7                               5 5*c4-n12, 5*c4-n13                 
24     NA 1-2               8 2                   c4-n[12-13]        8(x2)                   8                  1(x2)                             8 8*c4-n12, 8*c4-n13                 
25     NA 2                 8 2                   c4-n[12-13]        8(x2)                   8                  1(x2)                             8 8*c4-n12, 8*c4-n13                 
26     NA 2                 8 2                   c4-n[12-13]        8(x2)                   8                  1(x2)                             8 8*c4-n12, 8*c4-n13                 
27     NA 4                 8 4                   c4-n[1-4]          8(x4)                   8                  1(x4)                             8 8*c4-n1, 8*c4-n2, 8*c4-n3, 8*c4-n4 
HenrikBengtsson commented 1 month ago

Note to self: List also what CGroups and nproc report for each config.

HenrikBengtsson commented 1 month ago

Now with nproc and cgroups allocations too:

> print(data2, n = 100L)
# A tibble: 27 × 14
   ntasks cpus_per_task nodes NTASKS CPUS_PER_TASK JOB_NUM_NODES JOB_NODELIST    TASKS_PER_NODE JOB_CPUS_PER_NODE CPUS_ON_NODE availableCores availableWorkers                      nproc cgroups
    <int>         <int> <chr> <chr>  <chr>         <chr>         <chr>           <chr>          <chr>             <chr>                 <dbl> <chr>                                 <int>   <dbl>
 1     NA            NA NA    NA     NA            1             c4-n13          2              2                 2                         2 2*c4-n13                                  2       2
 2      1            NA NA    1      NA            1             c4-n13          1              2                 2                         2 2*c4-n13                                  2       2
 3      1            NA 1     1      NA            1             c4-n13          1              2                 2                         2 2*c4-n13                                  2       2
 4      2            NA NA    2      NA            1             c4-n13          2              2                 2                         2 2*c4-n13                                  2       2
 5      2            NA 1     2      NA            1             c4-n13          2              2                 2                         2 2*c4-n13                                  2       2
 6      2            NA 2     2      NA            2             c4-n[2-3]       1(x2)          2(x2)             2                         1 2*c4-n2, 2*c4-n3                          2       2
 7     16            NA NA    16     NA            1             c4-n13          16             16                16                       16 16*c4-n13                                16      16
 8     16            NA 1     16     NA            1             c4-n4           16             16                16                       16 16*c4-n4                                 16      16
 9     16            NA 4     16     NA            4             c4-n[4-5,37,39] 2(x2),10,2     2(x2),10,2        2                         2 10*c4-n37, 2*c4-n39, 2*c4-n4, 2*c4-n5     2       2
10      1             1 NA    1      1             1             c4-n13          1              2                 2                         1 1*c4-n13                                  2       2
11      1             1 1     1      1             1             c4-n13          1              2                 2                         1 1*c4-n13                                  2       2
12      2             1 NA    2      1             1             c4-n13          2              2                 2                         1 1*c4-n13                                  2       2
13      2             1 1     2      1             1             c4-n13          2              2                 2                         1 1*c4-n13                                  2       2
14      2             1 2     2      1             2             c4-n[3-4]       1(x2)          2(x2)             2                         1 1*c4-n3, 1*c4-n4                          2       2
15     16             1 4     16     1             4             c4-n[4-5,10,39] 7,4,3,2        8,4,3,2           8                         1 1*c4-n10, 1*c4-n39, 1*c4-n4, 1*c4-n5      8       8
16      2             2 NA    2      2             1             c4-n13          2              4                 4                         2 2*c4-n13                                  4       4
17      2             2 1     2      2             1             c4-n13          2              4                 4                         2 2*c4-n13                                  4       4
18      2             2 2     2      2             2             c4-n[3-4]       1(x2)          2(x2)             2                         2 2*c4-n3, 2*c4-n4                          2       2
19      4             2 NA    4      2             1             c4-n13          4              8                 8                         2 2*c4-n13                                  8       8
20      4             2 2     4      2             2             c4-n[3-4]       3,1            6,2               6                         2 2*c4-n3, 2*c4-n4                          6       6
21     16             3 NA    16     3             2             c4-n[3-4]       12,4           36,12             36                        3 3*c4-n3, 3*c4-n4                         36      36
22     16             4 NA    16     4             1             c4-n38          16             64                64                        4 4*c4-n38                                 64      64
23     16             5 NA    16     5             1             c4-n39          16             80                80                        5 5*c4-n39                                 80      80
24     NA             8 1-2   NA     8             1             c4-n13          1              8                 8                         8 8*c4-n13                                  8       8
25     NA             8 2     NA     8             2             c4-n[2-3]       1(x2)          8(x2)             8                         8 8*c4-n2, 8*c4-n3                          8       8
26     NA             8 2     NA     8             2             c4-n[3-4]       1(x2)          8(x2)             8                         8 8*c4-n3, 8*c4-n4                          8       8
27     NA             8 4     NA     8             4             c4-n[3,37-39]   1(x4)          8(x4)             8                         8 8*c4-n3, 8*c4-n37, 8*c4-n38, 8*c4-n39     8       8

This was generated using sbatch-params-all.R.txt