clemgoub / TE-Aid

Annotation helper tool for the manual curation of transposable element consensus sequences
https://doi.org/10.1186/s13100-021-00259-7
38 stars 6 forks source link

Issue when running TE-Aid in parallel #15

Open manighanipoor opened 2 months ago

manighanipoor commented 2 months ago

Hi,

I need to run TE-Aid in parallel but it causes errors because of using shared resources. I tried this command (to copy TE-Aid to a temp file for each process so it doesn't use the same database) in a HPC cluster in parallel but it does not work for all processes:

GENOME="../aipysurus_laevis.polished.fa" TEAID="/hpcfs/users/a1177955/local/TE-Aid/" parallel --bar --jobs 3 -a fasta_list.txt "mkdir -p ./tmp/{#}/TE-Aid && mkdir -p ./tmp/{#}/output && cp -ar $TEAID/ ./tmp/{#}/TE-Aid/ && ln -sf $(realpath $GENOME) ./tmp/{#}/genome_file && ./tmp/{#}/TE-Aid/TE-Aid -q {} -g ./tmp/{#}/genome_file -o ./tmp/{#}/output && mv ./tmp/{#}/output/ ./" && rm -r ./tmp/

and this is what I got (it just worked with process 1 and gave error for processes 2 and 3):

0% 0:3=0s fasta_3.fa query: fasta_2.fa ref genome: ./tmp/2/genome_file TE -> genome blastn e-value: 10e-8 full length min ratio: 0.9 hits transparency: 0.3 full length hits transparency: 0.9 no ORF detected, skipping blastp... [1] "R: ploting genome blastn results and computing coverage..." [1] "consensus length: 360 bp" [1] "R: ploting self dot-plot and orf/protein hits..." [1] "no orf to plot..." null device 1 Done! The graph (.pdf) can be found in the output folder: ./tmp/2/output Warning message: In file(file, "rt") : cannot open file './tmp/2/output/orftetable': No such file or directory 33% 1:2=31s fasta_3.fa query: fasta_1.fa ref genome: ./tmp/1/genome_file TE -> genome blastn e-value: 10e-8 full length min ratio: 0.9 hits transparency: 0.3 full length hits transparency: 0.9 RepeatPeps is downloaded and formatted, blastp-ing... [1] "R: ploting genome blastn results and computing coverage..." [1] "consensus length: 1582 bp" [1] "R: ploting self dot-plot and orf/protein hits..." null device 1 Done! The graph (.pdf) can be found in the output folder: ./tmp/1/output 66% 2:1=11s fasta_3.fa query: fasta_3.fa ref genome: ./tmp/3/genome_file TE -> genome blastn e-value: 10e-8 full length min ratio: 0.9 hits transparency: 0.3 full length hits transparency: 0.9 no ORF detected, skipping blastp... [1] "R: ploting genome blastn results and computing coverage..." [1] "consensus length: 541 bp" [1] "R: ploting self dot-plot and orf/protein hits..." [1] "no orf to plot..." null device 1 Done! The graph (.pdf) can be found in the output folder: ./tmp/3/output Warning message: In file(file, "rt") : cannot open file './tmp/3/output/orftetable': No such file or directory 100% 3:0=0s fasta_3.fa

would you please let me know what the solution is?

Cheers, Mani

foriin commented 2 months ago

Hi Mani,

First of all, as far as I know, TE-Aid wasn't made for running in parallel. The basic output of this tool is a pdf plot that you have to inspect manually, which is not feasible for multitude of TEs. In other words, TE-Aid was designed to work with a specific consensus for getting an overview of its structure and genome representation. Second, in order to maximize the speed without running TE-Aid in parallel and avoid potential collisions, you could just loop over your fastas with a bash script while using the same output folder. If your files and corresponding fasta headers have different names that should work fine and you won't download/generate BLAST databases for each fasta. I haven't worked with X laevis, but for danio, which has genome two times smaller, it takes ~15 seconds to run TE-Aid, when databases are prepared, so it shouldn't be as bad as well for your clawed friend. Anyhoo, I would just submit a bash script to your cluster that loops over your fastas:

#!/usr/bin/env bash
#SBATCH parameters or whatever HPC control system you have 
GENOME=/path/to/genome

for fa in ./*.fasta
do
    TE-Aid -q ${fa} -g ${GENOME} -o output_folder
done

And thirdly, the formatting of the parallel command you wrote in your question is broken. That makes it harder to read it and understand.

Cheers, Artem

manighanipoor commented 2 months ago

Hi, thanks, I could resolve the issue.