cooperative-computing-lab / cctools

The Cooperative Computing Tools (cctools) enable large scale distributed computations to harness hundreds to thousands of machines from clusters, clouds, and grids.
http://ccl.cse.nd.edu
Other
134 stars 120 forks source link

Slurm - makeflow7.2 failing when 8 jobs ara launched simultaneously, succeding when 1 is launched #2546

Closed stemangiola closed 3 years ago

stemangiola commented 3 years ago

Hello,

I have a strange recent behaviour. If I use a large makeflow file (that before was working) and many jobs are launched I have this error

Rscript dev/TCGA_makeflow_pipeline/infer_lv.R dev/armet_OV_input.rds dev/armet_OV_lv_1.rds 1 0 failed with exit code 1
deleted makeflow.failed.50

(how can I know what is going wrong)

If I just pick a couple of commands from this long file

CATEGORY=lv4
MEMORY=60024
CORES=4
dev/armet_ACC_lv_4_gender_regression.rds: dev/armet_ACC_lv_4_gender.rds
    Rscript dev/TCGA_makeflow_pipeline/infer_censored.R dev/armet_ACC_lv_4_gender.rds dev/armet_ACC_lv_4_gender_regression.rds
dev/armet_BLCA_lv_4_gender_regression.rds: dev/armet_BLCA_lv_4_gender.rds
    Rscript dev/TCGA_makeflow_pipeline/infer_censored.R dev/armet_BLCA_lv_4_gender.rds dev/armet_BLCA_lv_4_gender_regression.rds

Then I don't have that error

How can I know what is going on?

btovar commented 3 years ago

Stefano,

I would like to know more about that exit code 1 from the R script. Do you know what would cause such an error from armet_{ACC,BLCA}_lv_4_gender.rds?

One possibility shot-in-the-dark scenario is that dev/armet_{ACC,BLCA}_lv_4_gender_regression.rds are using some intermediate file that has not been declared. If the file is not there, then it the R scripts may try to generate it. Thus, running a single a rule at a time for each script and many thereafter works as all the needed files are present. Trying to run many rules first at the same time makes them all to try to generate a file, and they step on each other. These intermediate files may come from automatically generated temporary files, which are harder to track.

Note that in the example with two rules that works you are using two different R scripts. What happens when the rules use the same script?

stemangiola commented 3 years ago

Thanks for your reply,

I found out that the simple R script

library(rstan)

would cause failure if called ~30 times, and all jobs start simultaneously. Would you be able to replicate the error?

I am inquiring with them, but my question to you is: how can I see the error message, what option/pipe should I use ?

btovar commented 3 years ago

Stefano,

Something like this may help:

dev/armet_ACC_lv_4_gender_regression.rds  dev/armet_ACC_lv_4_gender_regression.stderr: dev/armet_ACC_lv_4_gender.rds
    Rscript dev/TCGA_makeflow_pipeline/infer_censored.R dev/armet_ACC_lv_4_gender.rds dev/armet_ACC_lv_4_gender_regression.rds > dev/armet_ACC_lv_4_gender_regression.stderr  2>&1

If the rds script prints anything to the console, either normal output (stdout) or error output (stderr), then these outputs should appear in dev/armet_ACC_lv_4_gender_regression.stderr

btovar commented 3 years ago

Stefano,

We can work on eliminating possible issues where tasks are stepping on each other. I see that rstan needs to be told the number of cores. Is that something you are doing inside your rds scripts?

I see that you need something like:

options(mc.cores = 4)

Or if you want to get the value set from CORES in makeflow:

options(mc.cores = as.numeric(Sys.getenv("CORES", unset=4))

Simply calling library(rstan) on my local machine >30 times does not produce an error. Do you see the error only in the cluster with slurm, or also somewhere else?

stemangiola commented 3 years ago

Hello, thanks to you error piping I found out that callr library was failing. No idea why, and could not fix, so I am reinstalling the whole R library and I will let you know asap.

Thanks!

btovar commented 3 years ago

Stefano, glad to hear you are making progress. Please keep us posted if you find a solution, or other issues.

stemangiola commented 3 years ago

Hello, I kind of understood that R future.batchtools that I was trying at the same time, destroys the temp directory, and all sort of problems start, instantly killing makeflow.

Thanks for your assistance!