EBI-Metagenomics / emg-viral-pipeline

VIRify: detection of phages and eukaryotic viruses from metagenomic and metatranscriptomic assemblies
Apache License 2.0
127 stars 16 forks source link

processing multiple inputs in parallel? #73

Closed shenwei356 closed 2 years ago

shenwei356 commented 2 years ago

Hi team, it's thrilling to have successfully run VIRify for the first time!

I found the CPU usage is low when running hmmsearch, so I'd like to execute multiple VIRfy processes in parallel. However, an error occurred:

Unknown error accessing project `EBI-Metagenomics/emg-viral-pipeline` -- Repository may be corrupted: 
/home/shenwei/.nextflow/assets/EBI-Metagenomics/emg-viral-pipeline

Then I have to process all samples one by one, no error by now.

Does VIRfy (nextflow) support processing multiple inputs in parallel? I think it does cause there's an option --cores:

--cores             max cores per process for local use [default: 40]
--max_cores         max cores per machine for local use [default: 160]

Wei


Command:

j=1     # number of VIRfy processes
J=40    # CPU number of each VIRfy process
mem=100 # max memory

db=~/ws/db/virify
sg=~/app/singularity
reads=assembly
ete3tax=ete3_ncbi_tax.sqlite
time fd ^contigs.fasta$ $reads/ \
    | grep -v work \
    | rush -j $j -v j=$J -v mem=$mem -v db=$db -v sg=$sg -v ete3tax=$ete3tax \
        'nextflow run EBI-Metagenomics/emg-viral-pipeline -r v0.4.0 \
            --databases {db} --cachedir {sg} --ncbi {ete3tax} \
            --fasta {} --workdir {/}/work --output {/}/virfy \
            -profile local,singularity --memory {mem} --cores {j} ' \
        -c -C virify.rush --verbose
mberacochea commented 2 years ago

Hi @shenwei356,

The pipeline was designed to process only input fasta at the time, but you should be able to run as many instances if VIRify as you want.

That error you are getting is related to Nextflow, not VIRIfy (as far as I can tell). Things you could try:

Cheers

shenwei356 commented 2 years ago

Hi Martín, thanks for your rapid reply.

I've tried to clean and re-pull the pipeline, but the same error occurred. Besides, after trying with multiple instances, it would be failed to run with one instance. So I have to re-pull again to run with one instance.

Not sure if related, but check that each instance of VIRify is using a different work directory

Confirmed.

BTW the server has an old java:

NOTE: Nextflow is not tested with Java 1.8.0_332 -- It's recommended the use of version 11 up to 18
hoelzer commented 2 years ago

Hey @shenwei356 thanks for your interest in the pipeline! I'm not sure if I fully get what you are trying to do. But just in general and as also mentioned by @mberacochea:

I'm not sure if the old JAVA might also cause issues.

shenwei356 commented 2 years ago

you can only start one nextflow pipeline from the same directory simultaneously. Nextflow is writing a hidden .nextflow.log file and therefore starting multiple nextflow commands from the same directory will cause errors.

I see, but I specify the --workdir which point different places (where the inputs are). If it's true, the right way should be changing directory to where the input file is before running the pipline.

hoelzer commented 2 years ago

you can only start one nextflow pipeline from the same directory simultaneously. Nextflow is writing a hidden .nextflow.log file and therefore starting multiple nextflow commands from the same directory will cause errors.

I see, but I specify the --workdir which point different places (where the inputs are). If it's true, the right way should be changing directory to where the input file is before running the pipline.

Yeah --workdir or in short -w will not help you. This just changes where to working directories are written when running processes in the pipeline. Still there will be the hidden .nextflow.log file written to the directory from which you start the pipeline. No other pipeline can then be started from the same folder while the other one is still running and locks the .nextflow.log file - at least as far as I know. So yes, what you would need to do is: create a directory for each run, cd into that dir and then execute nextflow.

shenwei356 commented 2 years ago

Thank you all!