bioinformatics-ufsc / AnnotaPipeline

Genome annotation pipeline
Apache License 2.0
8 stars 1 forks source link

Checkpoint system? #6

Open Artifice120 opened 2 months ago

Artifice120 commented 2 months ago

Afternoon,

I have attempted to use the AnnotaPipline on a SLURM cluster. It is running fine so far, except I can only run the node continuously for 6 days at a time. In this time frame the BLAST searches are not all able to finish and ends up starting from the very beginning at AUGUSTUS.

Is there a way to have AnnotaPipeline skip to the last step that it left off on based on the raw output folders or just have it skip to a specified point in the process?

GuiMaia commented 1 month ago

Greetings,

Thanks for this suggestion!

We discussed potential ways to address this issue: Adding an option to run DIAMOND, instead of BLAST, to perform the Similarity Analysis step; Adding a checkpoint system that checks output folders and files, and skips that respective step if the results are already there.

These should be implemented in a future release.

Artifice120 commented 1 month ago

Have a script for running blast search as an array for each contig. Not sure if it is helpful ....

#!bin/bash/

list=$(cat final-tigs)

tig=(
$list
)

#loop that repeats equal to the number of variables in the Array for all variables in arrat (@) the current vaiable value $names is diffrent for each iteration of the loop all other variables are "constant" exept the time varaible is equal to whatever the server currently says

for tig in "${tig[@]}" ; do
        echo "$tig"

## remove empty placeholder files in active directory

find . -type f -empty -delete

## If statement checks if output file exists, if it does then it skips to the next contig

output=$(echo "/lustre/isaac/scratch/jtorre28/foxgloves/purged/purged2/tmp/$tig.out")
query=$(echo "/lustre/isaac/scratch/jtorre28/foxgloves/purged/purged2/tmp/$tig.fa")

        if test -f $output ; then
                echo "$(date +%Y-%m-%d_%H:%M:%S) |     skipped $tig"

        fi
## If statement checks if output file for contig exists, If it does NOT then it extracts teh single contig sequence and blast searches that contig and echo's time of search

        if ! test -f $output ; then
                echo "$(date +%Y-%m-%d_%H:%M:%S) |     extracting $tig"
                sed -n "/$tig/,/t/p" pilon-bubble-filter.fasta | head -n -1 > $query

        blastn -db nt\
       -query $query \
       -outfmt '6 qseqid qgi qacc sseqid sallseqid sgi sallgi sacc sallacc qstart qend sstart send qseq sseq evalue bitscore score length pident nident mismatch positive gapopen gaps ppos frames qframe sframe btop staxids sscinames scomnames sblastnames sskingdoms stitle salltitles sstrand qcovs qcovhsp' \
       -max_target_seqs 10 \
       -max_hsps 1 \
       -evalue 1e-25 \
       -num_threads 48 \
       -out $output
        fi

done