error in rule megahit when running two or more jobs in parallel

franciscozorrilla / metaGEM

:gem: An easy-to-use workflow for generating context specific genome-scale metabolic models and predicting metabolic interactions within microbial communities directly from metagenomic data

https://franciscozorrilla.github.io/metaGEM/

MIT License

203 stars 42 forks source link

error in rule megahit when running two or more jobs in parallel #26

Closed matrs closed 3 years ago

matrs commented 3 years ago

Hello, thanks for this pipeline, it's been very useful. I found an error when running the rule megahit with two jobs in parallel . In the line 232 of the Snakefile :

....
-2 $(basename {input.R2}) \
 -o tmp;
echo "done. "

that -o tmp makes megahit to complain and stop because that file/folder already exist. I solved the problem defining an output name depending on the sample name:

#This is inside the shell command
out_dir=$(basename {input.R1} _R1.fastq.gz)
# then I use that as output name
....
-2 $(basename {input.R2}) \
 -o $out_dir;
echo "done. "

franciscozorrilla commented 3 years ago

Hi Jose,

Indeed that should fix this problem in your situation! I suspect that part of the problem also stems from the fact that your scratch/ path in the config.yaml file is likely pointing to a single directory, is this correct?

In the clusters I have used to develop metaGEM there is generally a variable called something like $TMPDIR or $SCRATCH, which has a job-specific directory for each sample when submitting jobs (e.g. this), meaning that you can use the same variable in the Snakefile and then each job will be given a unique storage location by the scheduler/cluster.

Does your cluster have such a variable? If so, then you can set your scratch/ path in the config.yaml file as shown below to avoid having to modify other rules that make use of the scratch/ directory.

    scratch: $YOUR_CLUSTER_TMPDIR

Thanks for reporting, I will update the documentation to elaborate on the usage of the scratch path.

Best wishes, Francisco

matrs commented 3 years ago

Hello Francisco, I didn't know that when submitting jobs to a $SCRATCH partition a unique directory for each job is created, that explains why nobody complained about this in the past. In this particular cluster there is no scratch partition, so no $SCRATCH is defined. The /tmp directory works as in any linux system and also is rather small so I defined tmp to be a directory in my $HOME in the json config (in this cluster, /home is a local file system).

Thank you for your help.

Jose luis

franciscozorrilla commented 3 years ago

Yes, unfortunately it can be a bit difficult to build readily usable/deployable pipelines when clusters tend to be quite idiosyncratic.

I am slightly concerned in your situation: when you submit jobs in parallel further downstream in the analysis (e.g. see Snakefile rule crossMap) then you will have multiple jobs trying to use the same directory and this will cause errors. At the moment I see 3 potential solutions:

The cleanest and easiest solution for you is to probably create a job specific sub directory at the start of jobs (within the scratch/ directory).
Alternatively you could simply remove/comment out the lines of code within the shell section of the Snakefile jobs that move files into the scratch dir.
The most annoying option would be to leave everything as is and submit jobs in series, but of course this defeats the purpose of using the cluster.

I will implement solution 1 in the Snakefile as soon as I get the chance. This would fix the problem for users that dont have a job specific $SCRATCH or $TMPDIR variable, while also not causing problems for users that do have that job specific var.

matrs commented 3 years ago

Thank you very much, I'll check the next steps in the pipeline the following days and I'll implement one of the solutions you suggested.

Thanks!