Parallel Run - Githubissues

USDA-VS / GenoFLU

Influenza data pipeline to automate genotyping assignment

GNU General Public License v3.0

18 stars 2 forks source link

Parallel Run #3

Open mwylerCH opened 3 months ago

mwylerCH commented 3 months ago

Many thanks for developing the tool. Somehow I get an error that it's hardly reproducible. If I'm running GenoFLU in a single instance/sequentially it runs through without problems. However, if I'm running them in parallel (for example on a HPC), I got the following issue:

creation of ${SEQNAME}_blast_hpia_genotyping_dir folders
presence of ${SEQNAME}.temp

with SEQNAME=file name.

As a perl monk I'm not too much into python, it's possible, that GenoFlu uses some unspecific variables/file names? Many thanks

mwylerCH commented 3 months ago

Are possibly the hpai_geno_db.n* files a source of the issue? They appear during the run in the HOME directory for each run. Maybe by packing them into a tempfolder or by checking if they are already available would improve the parallelization.

stuber commented 3 months ago

Yes, the BLAST files are likely the issue. As currently written GenoFLU must have only one FASTA file per directory. Possibly an update will be made to run multiple per directory in the future but for now I recommend packaging FASTA files into their own directory then looping over directories using the ampersand to run in the background.

# Package FASTA to directory
for i in *.fasta; do 
    mkdir ${i%.fasta}
    mv $i ${i%.fasta}
done

# Run all directories at one time
cdir=$(pwd)
for d in *; do 
    (cd $cdir; echo $d; cd ./$d; 
    genoflu.py -f *fasta -n ${d}
    cd $cdir) &
done