Closed Itachi-505 closed 2 years ago
@Itachi-505 could you post the output of
makeflow -Tslurm -J1 -dbatch example.makeflow
Also, are you able to submit slurm jobs without makeflow? If so, do you need to specify any special parameters, like a preferred queue, maximum execution time, etc.? If you need to specify such things, the sbatch command may fail.
Hi, here are the first few lines of the output of makeflow -Tslurm -J1 -dbatch example.makeflow
parsing test1.makeflow...
local resources: 32.000 cores, 193277 MB memory, 29350296 MB disk
max running remote jobs: 1
max running local jobs: 32
2022/01/28 00:01:26.42 makeflow[5274] batch: set feature local_job_queue' to
yes'
2022/01/28 00:01:26.42 makeflow[5274] batch: set feature absolute_path' to
yes'
2022/01/28 00:01:26.42 makeflow[5274] batch: set feature output_directories' to
yes'
2022/01/28 00:01:26.42 makeflow[5274] batch: set feature batch_log_name' to
%s.batchlog'
2022/01/28 00:01:26.42 makeflow[5274] batch: set feature gc_size' to
yes'
2022/01/28 00:01:26.42 makeflow[5274] batch: created queue 0x557577cacde0 (slurm)
2022/01/28 00:01:26.42 makeflow[5274] batch: set logfile to test1.makeflow.batchlog' 2022/01/28 00:01:26.42 makeflow[5274] batch: cleared option
batch-options'
2022/01/28 00:01:26.42 makeflow[5274] batch: cleared option password' 2022/01/28 00:01:26.42 makeflow[5274] batch: set option
manager-mode' to standalone' 2022/01/28 00:01:26.42 makeflow[5274] batch: cleared option
name'
2022/01/28 00:01:26.42 makeflow[5274] batch: cleared option debug' 2022/01/28 00:01:26.42 makeflow[5274] batch: cleared option
priority'
2022/01/28 00:01:26.42 makeflow[5274] batch: cleared option keepalive-interval' 2022/01/28 00:01:26.42 makeflow[5274] batch: cleared option
keepalive-timeout'
2022/01/28 00:01:26.42 makeflow[5274] batch: set option caching' to
yes'
2022/01/28 00:01:26.42 makeflow[5274] batch: cleared option wait-queue-size' 2022/01/28 00:01:26.42 makeflow[5274] batch: cleared option
amazon-config'
2022/01/28 00:01:26.42 makeflow[5274] batch: cleared option lambda-config' 2022/01/28 00:01:26.42 makeflow[5274] batch: cleared option
working-dir'
2022/01/28 00:01:26.42 makeflow[5274] batch: cleared option manager-preferred-connection' 2022/01/28 00:01:26.42 makeflow[5274] batch: cleared option
amazon-batch-config'
2022/01/28 00:01:26.42 makeflow[5274] batch: cleared option amazon-batch-img' 2022/01/28 00:01:26.42 makeflow[5274] batch: set option
safe-submit-mode' to no' 2022/01/28 00:01:26.42 makeflow[5274] batch: set option
ignore-mem-spec' to no' 2022/01/28 00:01:26.42 makeflow[5274] batch: cleared option
mem-type'
2022/01/28 00:01:26.42 makeflow[5274] batch: set option keep-wrapper-stdout' to
no'
2022/01/28 00:01:26.42 makeflow[5274] batch: set option tlq-port' to
0'
2022/01/28 00:01:26.42 makeflow[5274] batch: set option fast-abort' to
-1.000000'
2022/01/28 00:01:26.42 makeflow[5274] batch: set feature local_job_queue' to
yes'
2022/01/28 00:01:26.42 makeflow[5274] batch: set feature absolute_path' to
yes'
2022/01/28 00:01:26.42 makeflow[5274] batch: set feature output_directories' to
yes'
2022/01/28 00:01:26.42 makeflow[5274] batch: set feature batch_log_name' to
%s.batchlog'
2022/01/28 00:01:26.42 makeflow[5274] batch: set feature gc_size' to
yes'
2022/01/28 00:01:26.42 makeflow[5274] batch: cleared feature local_job_queue' 2022/01/28 00:01:26.42 makeflow[5274] batch: created queue 0x557577caf960 (local) checking test1.makeflow for consistency... test1.makeflow has 1 rules. creating new log file test1.makeflow.makeflowlog... checking files for unexpected changes... (use --skip-file-check to skip this step) starting workflow.... 2022/01/28 00:01:26.42 makeflow[5274] batch: set option
task-id' to `0'
submitting job: Rscript /stornext/HPCScratch/home/ma.m/test_makeflow_example/test1.R /stornext/HPCScratch/home/ma.m/test_makeflow_example/test1_output.txt
2022/01/28 00:01:26.42 makeflow[5274] batch: sbatch --mem=10024 M --time=25000001 -N 1 -n 1 -c 4 -D . -e /dev/null --export=ALL -o /dev/null -J makeflow0 ./slurm.wrapper
sbatch: error: Unable to open file M
2022/01/28 00:01:26.43 makeflow[5274] notice: job submission failed: no output from slurm
couldn't submit batch job, still trying...
Hi, I submitted slurm with makeflow. And I had the memory, cores, WALL_time in my makeflow script.
Yes, I see a bug in makeflow. We are adding a space between the memory value and the units. I'll add a fix soon If you need to run the jobs before that, I think the folowing should work:
Comment out the MEMORY=10024 line, and instead use:
makeflow -Tslurm -J1 -dbatch -B"--mem 10024M" example.makeflow
No worries, I will wait you. Thanks a lot for that. It will be great if you make a new comment here after you fixed it. 👍
@Itachi-505 when you have a chance, could you give release/7.4.3rc2 a try?
https://github.com/cooperative-computing-lab/cctools/releases/tag/release%2F7.4.3rc2
Sure. But which one I should try? I used mac pro. Is the cctools-7.4.3rc2-x86_64-osx-10.15.tar.gz is the correct one ?
And, I used conda to install the makeflow before.
Got it! Let me prepare a conda release for you.
@Itachi-505, when you have a chance could you give:
conda install -c conda-forge/label/ndcctools_rc -c conda-forge ndcctools
a try?
Hi, I think using conda not working as well which showed me the error.
Verifying transaction: failed EnvironmentNotWritableError: The current user does not have write permissions to the target environment. environment location: /stornext/System/data/apps/anaconda3/anaconda3-2019.03 uid: 4609 gid: 10908
How about I just download the new version 7.4.3rc2? Would you like to help me to figure out which tar.gz is correct?
Here is my sessionInfo: R version 4.1.0 (2021-05-18) Platform: x86_64-pc-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core)
Matrix products: default BLAS: /stornext/System/data/apps/R/R-4.1.0/lib64/R/lib/libRblas.so LAPACK: /stornext/System/data/apps/R/R-4.1.0/lib64/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
Let's give this a try:
curl -O http://ccl.cse.nd.edu/software/files/cctools-7.4.3rc3-x86_64-centos7.tar.gz
tar xf cctools-7.4.3rc3-x86_64-centos7.tar.gz
export PATH=$(pwd)/cctools-7.4.3rc3-x86_64-centos7.tar.gz-dir/bin:$PATH
makeflow -v
# expected output:
makeflow version 7.4.3 rc3 (released 2022-01-31 15:23:03 +0000)
Built by root@bdbf72bcc55c on 2022-01-31 15:23:03 +0000
Hi, I think I got the expected output. I have tried again and met another error again. /stornext/Home/data/allstaff/m/ma.m/cctools-7.4.3rc3-x86_64-centos7.tar.gz-dir/bin/makeflow -T slurm -J 200 /stornext/HPCScratch/home/ma.m/example/save_ppcseq.makeflow
starting workflow.... submitting job: Rscript /stornext/HPCScratch/home/ma.m/example/create_input.R /stornext/HPCScratch/home/ma.m/example/Check_Squamous_ppcseq.rds sbatch: error: Batch job submission failed: Requested time limit is invalid (missing or exceeds some limit) 2022/02/01 14:15:59.97 makeflow[25265] notice: job submission failed: no output from slurm couldn't submit batch job, still trying...
I wonder if the WALL_TIME=1500
that you specified is too large for that particular queue. Could you try something like WALL_TIME=300
or dropping it all together to see if that fixes it? If so, I would ask your sysadmin about the maximum times allowed per queue (called partition in slurm.) You can specify which partition to use with:
makeflow -B '-p NAME_OF_PARTITION' ... rest of your arguments ...
Hi, I tried to change the WALL_TIME = 300 and tried again which gave me the save error. The error was: starting workflow.... submitting job: Rscript /stornext/HPCScratch/home/ma.m/test_makeflow_example/test1.R /stornext/HPCScratch/home/ma.m/test_makeflow_example/test1_output.txt sbatch: error: Batch job submission failed: Requested time limit is invalid (missing or exceeds some limit) 2022/02/02 22:20:45.06 makeflow[13587] notice: job submission failed: no output from slurm couldn't submit batch job, still trying...
When I deleted 4 lines in front of my makeflow, (CATEGORY=test1_for_input_reading MEMORY=10024 CORES=4 WALL_TIME=300). I can get the correct output.
I am not sure which part is wrong, would you like to help me to double check it?
Could you include the lines:
CATEGORY=test1_for_input_reading MEMORY=10024 CORES=4
That is, do not include WALL_TIME, and then run try:
makeflow -Tslurm -B'--time=0:300' -J1 -dbatch example.makeflow
Hi, I got the correct results this time.
It showed some lines like this.
(base) [ma.m@slurm-login02 ~]$ /stornext/Home/data/allstaff/m/ma.m/cctools-7.4.3rc3-x86_64-centos7.tar.gz-dir/bin/makeflow -Tslurm -B'--time=0:300' -J1 -dbatch /stornext/HPCScratch/home/ma.m/test_makeflow_example/test1.makeflow
parsing /stornext/HPCScratch/home/ma.m/test_makeflow_example/test1.makeflow...
local resources: 32.000 cores, 193277 MB memory, 123482101 MB disk
max running remote jobs: 1
max running local jobs: 32
2022/02/02 22:47:15.57 makeflow[29991] batch: set feature local_job_queue' to
yes'
2022/02/02 22:47:15.57 makeflow[29991] batch: set feature absolute_path' to
yes'
2022/02/02 22:47:15.57 makeflow[29991] batch: set feature output_directories' to
yes'
2022/02/02 22:47:15.57 makeflow[29991] batch: set feature batch_log_name' to
%s.batchlog'
2022/02/02 22:47:15.57 makeflow[29991] batch: set feature gc_size' to
yes'
2022/02/02 22:47:15.57 makeflow[29991] batch: created queue 0x120bd60 (slurm)
2022/02/02 22:47:15.57 makeflow[29991] batch: set logfile to /stornext/HPCScratch/home/ma.m/test_makeflow_example/test1.makeflow.batchlog' 2022/02/02 22:47:15.57 makeflow[29991] batch: set option
batch-options' to --time=0:300' 2022/02/02 22:47:15.57 makeflow[29991] batch: cleared option
password'
2022/02/02 22:47:15.57 makeflow[29991] batch: set option manager-mode' to
standalone'
2022/02/02 22:47:15.57 makeflow[29991] batch: cleared option name' 2022/02/02 22:47:15.57 makeflow[29991] batch: cleared option
debug'
2022/02/02 22:47:15.57 makeflow[29991] batch: cleared option priority' 2022/02/02 22:47:15.57 makeflow[29991] batch: cleared option
keepalive-interval'
2022/02/02 22:47:15.57 makeflow[29991] batch: cleared option keepalive-timeout' 2022/02/02 22:47:15.57 makeflow[29991] batch: set option
caching' to yes' 2022/02/02 22:47:15.57 makeflow[29991] batch: cleared option
wait-queue-size'
2022/02/02 22:47:15.57 makeflow[29991] batch: cleared option amazon-config' 2022/02/02 22:47:15.57 makeflow[29991] batch: cleared option
lambda-config'
2022/02/02 22:47:15.57 makeflow[29991] batch: cleared option working-dir' 2022/02/02 22:47:15.57 makeflow[29991] batch: cleared option
manager-preferred-connection'
2022/02/02 22:47:15.57 makeflow[29991] batch: cleared option amazon-batch-config' 2022/02/02 22:47:15.57 makeflow[29991] batch: cleared option
amazon-batch-img'
2022/02/02 22:47:15.57 makeflow[29991] batch: set option safe-submit-mode' to
no'
2022/02/02 22:47:15.57 makeflow[29991] batch: set option ignore-mem-spec' to
no'
2022/02/02 22:47:15.57 makeflow[29991] batch: cleared option mem-type' 2022/02/02 22:47:15.57 makeflow[29991] batch: set option
keep-wrapper-stdout' to no' 2022/02/02 22:47:15.57 makeflow[29991] batch: set option
tlq-port' to 0' 2022/02/02 22:47:15.57 makeflow[29991] batch: set option
fast-abort' to -1.000000' 2022/02/02 22:47:15.57 makeflow[29991] batch: set feature
local_job_queue' to yes' 2022/02/02 22:47:15.57 makeflow[29991] batch: set feature
absolute_path' to yes' 2022/02/02 22:47:15.57 makeflow[29991] batch: set feature
output_directories' to yes' 2022/02/02 22:47:15.57 makeflow[29991] batch: set feature
batch_log_name' to %s.batchlog' 2022/02/02 22:47:15.57 makeflow[29991] batch: set feature
gc_size' to yes' 2022/02/02 22:47:15.57 makeflow[29991] batch: cleared feature
local_job_queue'
2022/02/02 22:47:15.57 makeflow[29991] batch: created queue 0x120e9b0 (local)
checking /stornext/HPCScratch/home/ma.m/test_makeflow_example/test1.makeflow for consistency...
/stornext/HPCScratch/home/ma.m/test_makeflow_example/test1.makeflow has 1 rules.
creating new log file /stornext/HPCScratch/home/ma.m/test_makeflow_example/test1.makeflow.makeflowlog...
checking files for unexpected changes... (use --skip-file-check to skip this step)
starting workflow....
2022/02/02 22:47:15.57 makeflow[29991] batch: set option task-id' to
0'
submitting job: Rscript /stornext/HPCScratch/home/ma.m/test_makeflow_example/test1.R /stornext/HPCScratch/home/ma.m/test_makeflow_example/test1_output.txt
2022/02/02 22:47:15.57 makeflow[29991] batch: sbatch --mem=10024M -N 1 -n 1 -c 4 -D . -e /dev/null --export=ALL -o /dev/null -J makeflow0 --time=0:300 ./slurm.wrapper
2022/02/02 22:47:15.59 makeflow[29991] batch: job 6477611 submitted
submitted job 6477611
2022/02/02 22:47:15.59 makeflow[29991] batch: set option batch-options' to
--time=0:300'
2022/02/02 22:47:15.59 makeflow[29991] batch: could not open status file "slurm.status.6477611"
2022/02/02 22:47:16.59 makeflow[29991] batch: could not open status file "slurm.status.6477611"
2022/02/02 22:47:20.59 makeflow[29991] batch: job 6477611 complete
job 6477611 completed
nothing left to do.
2022/02/02 22:47:20.59 makeflow[29991] batch: deleting queue 0x120bd60
2022/02/02 22:47:20.59 makeflow[29991] batch: deleting queue 0x120e9b0
Is this correct this time?
Great! Thanks for your patience. I'll submit a fix for you try very soon.
Sure, no worries! Thanks for your patience :)
The fix should be available here:
curl -O http://ccl.cse.nd.edu/software/files/cctools-7.4.3rc4-x86_64-centos7.tar.gz tar xf cctools-7.4.3rc4-x86_64-centos7.tar.gz
export PATH=$(pwd)/cctools-7.4.3rc4-x86_64-centos7.tar.gz-dir/bin:$PATH
makeflow -Tslurm ...etc...
This should correctly process the WALL_TIME specifications.
Thanks for that! It works well this time. May I ask another type of error? (failed with exit code 137) I googled that and someone said the DATABASE_URL environment variable with the docker run command line should be set up. Is this true? I mean how to set up it? I should add it in slurm or my makeflow?
starting workflow.... submitting job: Rscript /stornext/HPCScratch/home/ma.m/mengyao_data_scripts/COVID_19/run_ppcseq.R /stornext/HPCScratch/home/ma.m/single_cell_database/COVID_19/data/All_ppcseq.rds submitted job 6482560 job 6482560 completed Rscript /stornext/HPCScratch/home/ma.m/mengyao_data_scripts/COVID_19/run_ppcseq.R /stornext/HPCScratch/home/ma.m/single_cell_database/COVID_19/data/All_ppcseq.rds failed with exit code 137 2022/02/03 18:57:44.97 makeflow[9710] error: rule 0 failed, cannot move outputs 2022/02/03 18:57:44.97 makeflow[9710] error: hook Fail Dir:node_fail returned 1 deleted /stornext/HPCScratch/home/ma.m/single_cell_database/COVID_19/data/All_ppcseq.rds nothing left to do.
@Itachi-505 nice, that sounds like progress!
It does not look like you are using docker, but R? The exit code 137 usually means that some watchdog program terminated your process by sending a signal.
If it does not fail right away, a first thing to try is to increase the value of WALL_TIME (did you set it back to the original 1500?).
Yes, I am using R but not docker and I changed my WALL_TIME to 1500. Is it still to small for me to run ?
I'm not sure... How long do you expect your tasks to run? Does it fail right away? Could you try again and send me the output of:
makeflow -Tslurm -J1 -dall example.makeflow
Fixed in 7.4.3
Hi,
I met a problem about running makeflow through slurm. My makeflow works well when I used default base system (makeflow example.makeflow), but not worked through slrum (makeflow -T slurm -J 200 example.makeflow).
The error was: sbatch: error: Unable to open file M. 2022/01/27 16:03:11.75 makeflow[27470] notice: job submission failed: no output from slurm couldn't submit batch job, still trying...
Here is my makeflow script: CATEGORY=test1_for_input_reading MEMORY=10024 CORES=4 WALL_TIME=1500 /stornext/HPCScratch/home/ma.m/test_makeflow_example/test1_output.txt: Rscript /stornext/HPCScratch/home/ma.m/test_makeflow_example/test1.R /stornext/HPCScratch/home/ma.m/test_makeflow_example/test1_output.txt
I am not sure what the file M is. I have no idea how to fix this problem. Would you like to help me to figure out the reason?