cooperative-computing-lab / makeflow-examples

Example workflows for the Makeflow workflow system.
33 stars 18 forks source link

makeflow failed #45

Closed Kirito-Ma closed 2 years ago

Kirito-Ma commented 2 years ago

Hi, I met a problem that my makeflow did not work. I used R terminal to test it which works well. However, when I tried to use slurm makeflow to run it, it showed me the error.

parsing /stornext/HPCScratch/home/ma.m/test.makeflow... 2022/02/09 22:12:38.02 makeflow[11201] fatal: Found end of file while completing command. line: 6 column: 293 Terminated

Here is the code of my makeflow: CATEGORY=test_slurm MEMORY=10024 CORES=4 WALL_TIME=1500 /stornext/HPCScratch/home/ma.m/single_cell_database/COVID_19/data/all_ppcseq_data/Secretory_ppcseq.rds: Rscript /stornext/HPCScratch/home/ma.m/mengyao_data_scripts/COVID_19/run_ppcseq/run_ppcseq.R /stornext/HPCScratch/home/ma.m/single_cell_database/COVID_19/data/all_de_data/Secretory_DE.rds /stornext/HPCScratch/home/ma.m/single_cell_database/COVID_19/data/all_ppcseq_data/Secretory_ppcseq.rds

I thought the makeflow did not run at all. May I know how to fix it?

btovar commented 2 years ago

@Kirito-Ma, in makeflow spaces and new lines are important characters, I think you want:

CATEGORY=test_slurm
MEMORY=10024
CORES=4
WALL_TIME=1500
/stornext/HPCScratch/home/ma.m/single_cell_database/COVID_19/data/all_ppcseq_data/Secretory_ppcseq.rds:
        Rscript /stornext/HPCScratch/home/ma.m/mengyao_data_scripts/COVID_19/run_ppcseq/run_ppcseq.R /stornext/HPCScratch/home/ma.m/single_cell_database/COVID_19/data/all_de_data/Secretory_DE.rds /stornext/HPCScratch/home/ma.m/single_cell_database/COVID_19/data/all_ppcseq_data/Secretory_ppcseq.rds
btovar commented 2 years ago

I think you were missing the tab before Rscript and the new line at the end of file.

Itachi-505 commented 2 years ago

HI, thanks a lot. May I ask another problem? It shows similar situation. My Rscript runs well in the terminal but failed in the slurm. The error shows like this:

Rscript /stornext/HPCScratch/home/ma.m/mengyao_data_scripts/COVID_19/run_ppcseq/run_ppcseq.R /stornext/HPCScratch/home/ma.m/single_cell_database/COVID_19/data/all_de_data/Basal_DE.rds /stornext/HPCScratch/home/ma.m/single_cell_database/COVID_19/data/all_ppcseq_data/Basal_ppcseq.rds failed with exit code 137 deleted makeflow.failed.1 rule 1 failed, moving any outputs to makeflow.failed.1 deleted /stornext/HPCScratch/home/ma.m/single_cell_database/COVID_19/data/all_ppcseq_data/Basal_ppcseq.rds

btovar commented 2 years ago

Does it fail right away?

When running on a terminal, could you do a:

resource_monitor  -Omon  -- Rscript /stornext/HPCScratch/home/ma.m/mengyao_data_scripts/COVID_19/run_ppcseq/run_ppcseq.R /stornext/HPCScratch/home/ma.m/single_cell_database/COVID_19/data/all_de_data/Basal_DE.rds /stornext/HPCScratch/home/ma.m/single_cell_database/COVID_19/data/all_ppcseq_data/Basal_ppcseq.rds

and post the contents of the file mon.summary generated?

Itachi-505 commented 2 years ago

There is the error from the R terminal.

bash: resource_monitor: command not found

btovar commented 2 years ago

It should be in the same place as the makeflow executable. From where are you executing makeflow? The following should work:

# needed only once:
curl -O http://ccl.cse.nd.edu/software/files/cctools-7.4.3-x86_64-centos7.tar.gz
tar xf cctools-7.4.3-x86_64-centos7.tar.gz

# every time:
export PATH=$(pwd)/cctools-7.4.3-x86_64-centos7.tar.gz-dir/bin:$PATH

resource_monitor  -Omon  -- Rscript /stornext/HPCScratch/home/ma.m/mengyao_data_scripts/COVID_19/run_ppcseq/run_ppcseq.R /stornext/HPCScratch/home/ma.m/single_cell_database/COVID_19/data/all_de_data/Basal_DE.rds /stornext/HPCScratch/home/ma.m/single_cell_database/COVID_19/data/all_ppcseq_data/Basal_ppcseq.rds
btovar commented 2 years ago

Could you open the file mon.summary, select everything, and copy the contents here? (The file should have been generated by the above command.)

Itachi-505 commented 2 years ago

Hi, thanks.

{ "executable_type":"dynamic", "monitor_version":"7.4.3.", "host":"milton-login02.hpc.wehi.edu.au", "command":"Rscript /stornext/HPCScratch/home/ma.m/mengyao_data_scripts/COVID_19/run_ppcseq/run_ppcseq.R /stornext/HPCScratch/home/ma.m/single_cell_database/COVID_19/data/all_de_data/Basal_DE.rds /stornext/HPCScratch/home/ma.m/single_cell_database/COVID_19/data/all_ppcseq_data/Basal_ppcseq.rds", "exit_status":0, "exit_type":"normal", "start": [ 1644415133.555023, "s" ], "end": [ 1644415588.830632, "s" ], "wall_time": [ 455.275609, "s" ], "cpu_time": [ 458.49, "s" ], "memory": [ 17246, "MB" ], "virtual_memory": [ 19713, "MB" ], "swap_memory": [ 0, "MB" ], "disk": [ 2627, "MB" ], "bytes_read": [ 80, "MB" ], "bytes_written": [ 73, "MB" ], "bytes_received": [ 0, "MB" ], "bytes_sent": [ 0, "MB" ], "bandwidth": [ 0, "Mbps" ], "gpus": [ 0, "gpus" ], "cores": [ 1.195, "cores" ], "cores_avg": [ 1.007, "cores" ], "machine_cpus": [ 32, "cores" ], "machine_load": [ 1, "procs" ], "context_switches": [ 12984, "switches" ], "max_concurrent_processes": [ 4, "procs" ], "total_processes": [ 9, "procs" ], "total_files": [ 40676, "files" ], "fs_nodes": [ 0, "nodes" ], "workers": [ 0, "workers" ], "peak_times": { "total_files": [ 12.001, "s" ], "max_concurrent_processes": [ 20.65, "s" ], "context_switches": [ 455.276, "s" ], "machine_load": [ 123.017, "s" ], "machine_cpus": [ 4.447, "s" ], "cores_avg": [ 424.745, "s" ], "cores": [ 432.562, "s" ], "bytes_sent": [ 36.596, "s" ], "bytes_received": [ 36.596, "s" ], "bytes_written": [ 385.025, "s" ], "bytes_read": [ 385.025, "s" ], "disk": [ 12.001, "s" ], "virtual_memory": [ 385.025, "s" ], "memory": [ 385.025, "s" ], "cpu_time": [ 450.317, "s" ], "wall_time": [ 455.276, "s" ], "end": [ 455.276, "s" ], "start": [ 0, "s" ] } }

btovar commented 2 years ago

Great! I think we are getting somewhere. It seems that your program uses more memory than the one you specified, and therefore is killed eventually by slurm. If you look above, you'll see:

"memory":
[
17246,
"MB"
],

I would try by changing your makeflow memory line to: MEMORY=20000 and see if that works.

Itachi-505 commented 2 years ago

Hi btovar,

Thanks a lot. It works this time!

Itachi-505 commented 2 years ago

Hi btovar,

I have one more question. In my Rscript, I got a dataframe. I would like it to perform nothing and not save it to rds. I create an empty tibble but not save to rds. However, it always cause slurm to fail. May I ask why this is the case and how to fix it?

btovar commented 2 years ago

Nice, good news!

For your error, we can get the error output as follows: modify your makeflow rule to be something like:

CATEGORY=test_slurm
MEMORY=20000
CORES=4
WALL_TIME=1500
/stornext/HPCScratch/home/ma.m/single_cell_database/COVID_19/data/all_ppcseq_data/Secretory_ppcseq.rds ERROR-1.output:
        Rscript /stornext/HPCScratch/home/ma.m/mengyao_data_scripts/COVID_19/run_ppcseq/run_ppcseq.R /stornext/HPCScratch/home/ma.m/single_cell_database/COVID_19/data/all_de_data/Secretory_DE.rds /stornext/HPCScratch/home/ma.m/single_cell_database/COVID_19/data/all_ppcseq_data/Secretory_ppcseq.rds  > ERROR-1.output 2>&1

That is, we add the ERROR-1.output file as an output, and append > ERROR-1.output 2>&1 at the end of the command line. Once the workflow fails, you can open ERROR-1.output and see the exact error you are getting from R.

Itachi-505 commented 2 years ago

Hi, thanks! I solved the problem by following your instructions.

btovar commented 2 years ago

Thanks for letting us know!