cooperative-computing-lab / makeflow-examples

Example workflows for the Makeflow workflow system.
33 stars 18 forks source link

sbatch error : unable to open file M #44

Closed Itachi-505 closed 2 years ago

Itachi-505 commented 2 years ago

Hi,

I met a problem about running makeflow through slurm. My makeflow works well when I used default base system (makeflow example.makeflow), but not worked through slrum (makeflow -T slurm -J 200 example.makeflow).

The error was: sbatch: error: Unable to open file M. 2022/01/27 16:03:11.75 makeflow[27470] notice: job submission failed: no output from slurm couldn't submit batch job, still trying...

Here is my makeflow script: CATEGORY=test1_for_input_reading MEMORY=10024 CORES=4 WALL_TIME=1500 /stornext/HPCScratch/home/ma.m/test_makeflow_example/test1_output.txt: Rscript /stornext/HPCScratch/home/ma.m/test_makeflow_example/test1.R /stornext/HPCScratch/home/ma.m/test_makeflow_example/test1_output.txt

I am not sure what the file M is. I have no idea how to fix this problem. Would you like to help me to figure out the reason?

btovar commented 2 years ago

@Itachi-505 could you post the output of

makeflow -Tslurm -J1 -dbatch example.makeflow

Also, are you able to submit slurm jobs without makeflow? If so, do you need to specify any special parameters, like a preferred queue, maximum execution time, etc.? If you need to specify such things, the sbatch command may fail.

Itachi-505 commented 2 years ago

Hi, here are the first few lines of the output of makeflow -Tslurm -J1 -dbatch example.makeflow

parsing test1.makeflow... local resources: 32.000 cores, 193277 MB memory, 29350296 MB disk max running remote jobs: 1 max running local jobs: 32 2022/01/28 00:01:26.42 makeflow[5274] batch: set feature local_job_queue' toyes' 2022/01/28 00:01:26.42 makeflow[5274] batch: set feature absolute_path' toyes' 2022/01/28 00:01:26.42 makeflow[5274] batch: set feature output_directories' toyes' 2022/01/28 00:01:26.42 makeflow[5274] batch: set feature batch_log_name' to%s.batchlog' 2022/01/28 00:01:26.42 makeflow[5274] batch: set feature gc_size' toyes' 2022/01/28 00:01:26.42 makeflow[5274] batch: created queue 0x557577cacde0 (slurm) 2022/01/28 00:01:26.42 makeflow[5274] batch: set logfile to test1.makeflow.batchlog' 2022/01/28 00:01:26.42 makeflow[5274] batch: cleared optionbatch-options' 2022/01/28 00:01:26.42 makeflow[5274] batch: cleared option password' 2022/01/28 00:01:26.42 makeflow[5274] batch: set optionmanager-mode' to standalone' 2022/01/28 00:01:26.42 makeflow[5274] batch: cleared optionname' 2022/01/28 00:01:26.42 makeflow[5274] batch: cleared option debug' 2022/01/28 00:01:26.42 makeflow[5274] batch: cleared optionpriority' 2022/01/28 00:01:26.42 makeflow[5274] batch: cleared option keepalive-interval' 2022/01/28 00:01:26.42 makeflow[5274] batch: cleared optionkeepalive-timeout' 2022/01/28 00:01:26.42 makeflow[5274] batch: set option caching' toyes' 2022/01/28 00:01:26.42 makeflow[5274] batch: cleared option wait-queue-size' 2022/01/28 00:01:26.42 makeflow[5274] batch: cleared optionamazon-config' 2022/01/28 00:01:26.42 makeflow[5274] batch: cleared option lambda-config' 2022/01/28 00:01:26.42 makeflow[5274] batch: cleared optionworking-dir' 2022/01/28 00:01:26.42 makeflow[5274] batch: cleared option manager-preferred-connection' 2022/01/28 00:01:26.42 makeflow[5274] batch: cleared optionamazon-batch-config' 2022/01/28 00:01:26.42 makeflow[5274] batch: cleared option amazon-batch-img' 2022/01/28 00:01:26.42 makeflow[5274] batch: set optionsafe-submit-mode' to no' 2022/01/28 00:01:26.42 makeflow[5274] batch: set optionignore-mem-spec' to no' 2022/01/28 00:01:26.42 makeflow[5274] batch: cleared optionmem-type' 2022/01/28 00:01:26.42 makeflow[5274] batch: set option keep-wrapper-stdout' tono' 2022/01/28 00:01:26.42 makeflow[5274] batch: set option tlq-port' to0' 2022/01/28 00:01:26.42 makeflow[5274] batch: set option fast-abort' to-1.000000' 2022/01/28 00:01:26.42 makeflow[5274] batch: set feature local_job_queue' toyes' 2022/01/28 00:01:26.42 makeflow[5274] batch: set feature absolute_path' toyes' 2022/01/28 00:01:26.42 makeflow[5274] batch: set feature output_directories' toyes' 2022/01/28 00:01:26.42 makeflow[5274] batch: set feature batch_log_name' to%s.batchlog' 2022/01/28 00:01:26.42 makeflow[5274] batch: set feature gc_size' toyes' 2022/01/28 00:01:26.42 makeflow[5274] batch: cleared feature local_job_queue' 2022/01/28 00:01:26.42 makeflow[5274] batch: created queue 0x557577caf960 (local) checking test1.makeflow for consistency... test1.makeflow has 1 rules. creating new log file test1.makeflow.makeflowlog... checking files for unexpected changes... (use --skip-file-check to skip this step) starting workflow.... 2022/01/28 00:01:26.42 makeflow[5274] batch: set optiontask-id' to `0' submitting job: Rscript /stornext/HPCScratch/home/ma.m/test_makeflow_example/test1.R /stornext/HPCScratch/home/ma.m/test_makeflow_example/test1_output.txt 2022/01/28 00:01:26.42 makeflow[5274] batch: sbatch --mem=10024 M --time=25000001 -N 1 -n 1 -c 4 -D . -e /dev/null --export=ALL -o /dev/null -J makeflow0 ./slurm.wrapper sbatch: error: Unable to open file M 2022/01/28 00:01:26.43 makeflow[5274] notice: job submission failed: no output from slurm couldn't submit batch job, still trying...

Itachi-505 commented 2 years ago

Hi, I submitted slurm with makeflow. And I had the memory, cores, WALL_time in my makeflow script.

btovar commented 2 years ago

Yes, I see a bug in makeflow. We are adding a space between the memory value and the units. I'll add a fix soon If you need to run the jobs before that, I think the folowing should work:

Comment out the MEMORY=10024 line, and instead use:

makeflow -Tslurm -J1 -dbatch -B"--mem 10024M"  example.makeflow
Itachi-505 commented 2 years ago

No worries, I will wait you. Thanks a lot for that. It will be great if you make a new comment here after you fixed it. 👍

btovar commented 2 years ago

@Itachi-505 when you have a chance, could you give release/7.4.3rc2 a try?

https://github.com/cooperative-computing-lab/cctools/releases/tag/release%2F7.4.3rc2

Itachi-505 commented 2 years ago

Sure. But which one I should try? I used mac pro. Is the cctools-7.4.3rc2-x86_64-osx-10.15.tar.gz is the correct one ?

Itachi-505 commented 2 years ago

And, I used conda to install the makeflow before.

btovar commented 2 years ago

Got it! Let me prepare a conda release for you.

btovar commented 2 years ago

@Itachi-505, when you have a chance could you give:

conda install -c conda-forge/label/ndcctools_rc -c conda-forge  ndcctools

a try?

Itachi-505 commented 2 years ago

Hi, I think using conda not working as well which showed me the error.

Verifying transaction: failed EnvironmentNotWritableError: The current user does not have write permissions to the target environment. environment location: /stornext/System/data/apps/anaconda3/anaconda3-2019.03 uid: 4609 gid: 10908

Itachi-505 commented 2 years ago

How about I just download the new version 7.4.3rc2? Would you like to help me to figure out which tar.gz is correct?

Here is my sessionInfo: R version 4.1.0 (2021-05-18) Platform: x86_64-pc-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core)

Matrix products: default BLAS: /stornext/System/data/apps/R/R-4.1.0/lib64/R/lib/libRblas.so LAPACK: /stornext/System/data/apps/R/R-4.1.0/lib64/R/lib/libRlapack.so

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

btovar commented 2 years ago

Let's give this a try:

curl -O http://ccl.cse.nd.edu/software/files/cctools-7.4.3rc3-x86_64-centos7.tar.gz
tar xf cctools-7.4.3rc3-x86_64-centos7.tar.gz

export PATH=$(pwd)/cctools-7.4.3rc3-x86_64-centos7.tar.gz-dir/bin:$PATH

makeflow -v

# expected output:
makeflow version 7.4.3 rc3 (released 2022-01-31 15:23:03 +0000)
    Built by root@bdbf72bcc55c on 2022-01-31 15:23:03 +0000
Itachi-505 commented 2 years ago

Hi, I think I got the expected output. I have tried again and met another error again. /stornext/Home/data/allstaff/m/ma.m/cctools-7.4.3rc3-x86_64-centos7.tar.gz-dir/bin/makeflow -T slurm -J 200 /stornext/HPCScratch/home/ma.m/example/save_ppcseq.makeflow

starting workflow.... submitting job: Rscript /stornext/HPCScratch/home/ma.m/example/create_input.R /stornext/HPCScratch/home/ma.m/example/Check_Squamous_ppcseq.rds sbatch: error: Batch job submission failed: Requested time limit is invalid (missing or exceeds some limit) 2022/02/01 14:15:59.97 makeflow[25265] notice: job submission failed: no output from slurm couldn't submit batch job, still trying...

btovar commented 2 years ago

I wonder if the WALL_TIME=1500 that you specified is too large for that particular queue. Could you try something like WALL_TIME=300 or dropping it all together to see if that fixes it? If so, I would ask your sysadmin about the maximum times allowed per queue (called partition in slurm.) You can specify which partition to use with:

 makeflow -B '-p NAME_OF_PARTITION' ... rest of your arguments ...
Itachi-505 commented 2 years ago

Hi, I tried to change the WALL_TIME = 300 and tried again which gave me the save error. The error was: starting workflow.... submitting job: Rscript /stornext/HPCScratch/home/ma.m/test_makeflow_example/test1.R /stornext/HPCScratch/home/ma.m/test_makeflow_example/test1_output.txt sbatch: error: Batch job submission failed: Requested time limit is invalid (missing or exceeds some limit) 2022/02/02 22:20:45.06 makeflow[13587] notice: job submission failed: no output from slurm couldn't submit batch job, still trying...

When I deleted 4 lines in front of my makeflow, (CATEGORY=test1_for_input_reading MEMORY=10024 CORES=4 WALL_TIME=300). I can get the correct output.

I am not sure which part is wrong, would you like to help me to double check it?

btovar commented 2 years ago

Could you include the lines:

CATEGORY=test1_for_input_reading MEMORY=10024 CORES=4

That is, do not include WALL_TIME, and then run try:

makeflow -Tslurm -B'--time=0:300' -J1 -dbatch example.makeflow
Itachi-505 commented 2 years ago

Hi, I got the correct results this time.

It showed some lines like this.

(base) [ma.m@slurm-login02 ~]$ /stornext/Home/data/allstaff/m/ma.m/cctools-7.4.3rc3-x86_64-centos7.tar.gz-dir/bin/makeflow -Tslurm -B'--time=0:300' -J1 -dbatch /stornext/HPCScratch/home/ma.m/test_makeflow_example/test1.makeflow parsing /stornext/HPCScratch/home/ma.m/test_makeflow_example/test1.makeflow... local resources: 32.000 cores, 193277 MB memory, 123482101 MB disk max running remote jobs: 1 max running local jobs: 32 2022/02/02 22:47:15.57 makeflow[29991] batch: set feature local_job_queue' toyes' 2022/02/02 22:47:15.57 makeflow[29991] batch: set feature absolute_path' toyes' 2022/02/02 22:47:15.57 makeflow[29991] batch: set feature output_directories' toyes' 2022/02/02 22:47:15.57 makeflow[29991] batch: set feature batch_log_name' to%s.batchlog' 2022/02/02 22:47:15.57 makeflow[29991] batch: set feature gc_size' toyes' 2022/02/02 22:47:15.57 makeflow[29991] batch: created queue 0x120bd60 (slurm) 2022/02/02 22:47:15.57 makeflow[29991] batch: set logfile to /stornext/HPCScratch/home/ma.m/test_makeflow_example/test1.makeflow.batchlog' 2022/02/02 22:47:15.57 makeflow[29991] batch: set optionbatch-options' to --time=0:300' 2022/02/02 22:47:15.57 makeflow[29991] batch: cleared optionpassword' 2022/02/02 22:47:15.57 makeflow[29991] batch: set option manager-mode' tostandalone' 2022/02/02 22:47:15.57 makeflow[29991] batch: cleared option name' 2022/02/02 22:47:15.57 makeflow[29991] batch: cleared optiondebug' 2022/02/02 22:47:15.57 makeflow[29991] batch: cleared option priority' 2022/02/02 22:47:15.57 makeflow[29991] batch: cleared optionkeepalive-interval' 2022/02/02 22:47:15.57 makeflow[29991] batch: cleared option keepalive-timeout' 2022/02/02 22:47:15.57 makeflow[29991] batch: set optioncaching' to yes' 2022/02/02 22:47:15.57 makeflow[29991] batch: cleared optionwait-queue-size' 2022/02/02 22:47:15.57 makeflow[29991] batch: cleared option amazon-config' 2022/02/02 22:47:15.57 makeflow[29991] batch: cleared optionlambda-config' 2022/02/02 22:47:15.57 makeflow[29991] batch: cleared option working-dir' 2022/02/02 22:47:15.57 makeflow[29991] batch: cleared optionmanager-preferred-connection' 2022/02/02 22:47:15.57 makeflow[29991] batch: cleared option amazon-batch-config' 2022/02/02 22:47:15.57 makeflow[29991] batch: cleared optionamazon-batch-img' 2022/02/02 22:47:15.57 makeflow[29991] batch: set option safe-submit-mode' tono' 2022/02/02 22:47:15.57 makeflow[29991] batch: set option ignore-mem-spec' tono' 2022/02/02 22:47:15.57 makeflow[29991] batch: cleared option mem-type' 2022/02/02 22:47:15.57 makeflow[29991] batch: set optionkeep-wrapper-stdout' to no' 2022/02/02 22:47:15.57 makeflow[29991] batch: set optiontlq-port' to 0' 2022/02/02 22:47:15.57 makeflow[29991] batch: set optionfast-abort' to -1.000000' 2022/02/02 22:47:15.57 makeflow[29991] batch: set featurelocal_job_queue' to yes' 2022/02/02 22:47:15.57 makeflow[29991] batch: set featureabsolute_path' to yes' 2022/02/02 22:47:15.57 makeflow[29991] batch: set featureoutput_directories' to yes' 2022/02/02 22:47:15.57 makeflow[29991] batch: set featurebatch_log_name' to %s.batchlog' 2022/02/02 22:47:15.57 makeflow[29991] batch: set featuregc_size' to yes' 2022/02/02 22:47:15.57 makeflow[29991] batch: cleared featurelocal_job_queue' 2022/02/02 22:47:15.57 makeflow[29991] batch: created queue 0x120e9b0 (local) checking /stornext/HPCScratch/home/ma.m/test_makeflow_example/test1.makeflow for consistency... /stornext/HPCScratch/home/ma.m/test_makeflow_example/test1.makeflow has 1 rules. creating new log file /stornext/HPCScratch/home/ma.m/test_makeflow_example/test1.makeflow.makeflowlog... checking files for unexpected changes... (use --skip-file-check to skip this step) starting workflow.... 2022/02/02 22:47:15.57 makeflow[29991] batch: set option task-id' to0' submitting job: Rscript /stornext/HPCScratch/home/ma.m/test_makeflow_example/test1.R /stornext/HPCScratch/home/ma.m/test_makeflow_example/test1_output.txt 2022/02/02 22:47:15.57 makeflow[29991] batch: sbatch --mem=10024M -N 1 -n 1 -c 4 -D . -e /dev/null --export=ALL -o /dev/null -J makeflow0 --time=0:300 ./slurm.wrapper 2022/02/02 22:47:15.59 makeflow[29991] batch: job 6477611 submitted submitted job 6477611 2022/02/02 22:47:15.59 makeflow[29991] batch: set option batch-options' to--time=0:300' 2022/02/02 22:47:15.59 makeflow[29991] batch: could not open status file "slurm.status.6477611" 2022/02/02 22:47:16.59 makeflow[29991] batch: could not open status file "slurm.status.6477611" 2022/02/02 22:47:20.59 makeflow[29991] batch: job 6477611 complete job 6477611 completed nothing left to do. 2022/02/02 22:47:20.59 makeflow[29991] batch: deleting queue 0x120bd60 2022/02/02 22:47:20.59 makeflow[29991] batch: deleting queue 0x120e9b0

Is this correct this time?

btovar commented 2 years ago

Great! Thanks for your patience. I'll submit a fix for you try very soon.

Itachi-505 commented 2 years ago

Sure, no worries! Thanks for your patience :)

btovar commented 2 years ago

The fix should be available here:

curl -O http://ccl.cse.nd.edu/software/files/cctools-7.4.3rc4-x86_64-centos7.tar.gz tar xf cctools-7.4.3rc4-x86_64-centos7.tar.gz

export PATH=$(pwd)/cctools-7.4.3rc4-x86_64-centos7.tar.gz-dir/bin:$PATH

makeflow -Tslurm ...etc...

This should correctly process the WALL_TIME specifications.

Itachi-505 commented 2 years ago

Thanks for that! It works well this time. May I ask another type of error? (failed with exit code 137) I googled that and someone said the DATABASE_URL environment variable with the docker run command line should be set up. Is this true? I mean how to set up it? I should add it in slurm or my makeflow?

starting workflow.... submitting job: Rscript /stornext/HPCScratch/home/ma.m/mengyao_data_scripts/COVID_19/run_ppcseq.R /stornext/HPCScratch/home/ma.m/single_cell_database/COVID_19/data/All_ppcseq.rds submitted job 6482560 job 6482560 completed Rscript /stornext/HPCScratch/home/ma.m/mengyao_data_scripts/COVID_19/run_ppcseq.R /stornext/HPCScratch/home/ma.m/single_cell_database/COVID_19/data/All_ppcseq.rds failed with exit code 137 2022/02/03 18:57:44.97 makeflow[9710] error: rule 0 failed, cannot move outputs 2022/02/03 18:57:44.97 makeflow[9710] error: hook Fail Dir:node_fail returned 1 deleted /stornext/HPCScratch/home/ma.m/single_cell_database/COVID_19/data/All_ppcseq.rds nothing left to do.

btovar commented 2 years ago

@Itachi-505 nice, that sounds like progress!

It does not look like you are using docker, but R? The exit code 137 usually means that some watchdog program terminated your process by sending a signal.

If it does not fail right away, a first thing to try is to increase the value of WALL_TIME (did you set it back to the original 1500?).

Itachi-505 commented 2 years ago

Yes, I am using R but not docker and I changed my WALL_TIME to 1500. Is it still to small for me to run ?

btovar commented 2 years ago

I'm not sure... How long do you expect your tasks to run? Does it fail right away? Could you try again and send me the output of:

makeflow -Tslurm -J1 -dall example.makeflow

btovar commented 2 years ago

Fixed in 7.4.3