Implementation of epa-ng doesn't allow for multiple threads

bowmanjeffs / paprica

paprica - PAthway PRediction by phylogenetIC plAcement

26 stars 8 forks source link

Implementation of epa-ng doesn't allow for multiple threads #73

Closed elijthomas closed 4 years ago

elijthomas commented 4 years ago

When running paprica > 4.1.0

If you run (hypothetically) 10 sequences in parallel (from the same directory) they can all generate a epa_result.jplace file, as this file isn't name-spaced in anyway there are no guarantees which process generates it, and renames it to query + '.' + ref + '.clean.unique.align.jplace

There is also the chance of these processes overwriting the previous file.

Although slim, This race condition can cause one processes epa_results to land in another samples `clean.unique.align.jplace' file.

One suggestion might be using the --outdir option on epa-ng and temporarily storing it in a folder named $query mv the file and then delete the tmp directory.

bowmanjeffs commented 4 years ago

Eli, seems reasonable and I'll implement that solution in the very near future. To be clear, you experienced this problem when running paprica concurrently on multiple samples? Note that there isn't much of a performance gain for this. The recommendation is to run samples sequentially and let Infernal and epa-ng handle parallelization.

elijthomas commented 4 years ago

Thats correct @bowmanjeffs I ran my samples in parallel, and your correct, there are almost no benefits of running multiple samples in parallel pre the epa-ng step, but what I found is epa-ng uses very little CPU and another sample could utilise this free CPU time and therefore reduces the overall processing time quite significantly for large sample sets.

Ive made a 0.5.2 AMI on AWS using all homebrew installed libraries that i'll share with you once my samples are complete

bowmanjeffs commented 4 years ago

Thanks Eli, look forward to it.

elijthomas commented 4 years ago

I worked around the concurrency issue by placing each sample in its own directory and running paprica in parallel from there, this generates both the epa_info.log and epa_result.jplace independently in each samples directory and avoid file contention problems.

#!/bin/bash

call_paprica(){
  cd ~/results
  mkdir $1.dir
  cp ~/experiment/$1.fasta $1.dir/
  cd $1.dir
  ~/paprica/paprica-run.sh $1 bacteria #bacteria is the database to run it against. 
}

mkdir ~/results

while IFS= read file_name;
do
   call_paprica "$file_name" &
done < paprica.list.txt # files is just a list of the fasta files in the directory but with the .fasta at the end removed. 

wait
echo "All samples processed"

bowmanjeffs commented 4 years ago

Okay, epa-ng output now created in temporary folder. When run complete it's moved to working directory and given a unique name.