NBISweden / pipelines-nextflow

A set of workflows written in Nextflow for Genome Annotation.
GNU General Public License v3.0
43 stars 18 forks source link

AugustusTraining add extra steps #2

Closed Juke34 closed 4 years ago

Juke34 commented 4 years ago

Would be nice to add at the end (where asecodes_parviclava is set as species parameter in the workflow) the Augustus training steps:

new_species.pl --species=asecodes_parviclava
etraining –-species=asecodes_parviclava outdir/TrainingData/codingGeneFeatures.filter.longest_cds.complete.good_distance_blast-filtered.gbk.train 
augustus --species=asecodes_parviclava output.gbk.test | tee run.log
augustus --species=asecodes_parviclava TestingData/codingGeneFeatures.filter.longest_cds.complete.good_distance_blast-filtered.gbk.test | tee run.log

Require Augustus and the path to the share profile folder

mahesh-panchal commented 4 years ago

Where is the output.gbk.test from?

Line 2 uses the training set output from gbk2augustus Line 3 uses the test set output from where ? Line 4 uses the test set output from gbk2augustus

process augustus_training {

    tag "$species"
    label 'Augustus'
    publishDir "${params.outdir}/Augustus_training", mode: 'copy'

    input:
    path training_file
    path test_file
    each species

    output:
    path "${species}_run.log"

    script:
    """
    new_species.pl --species=$species
    etraining –-species=$species $training_file
    augustus --species=$species $test_file | tee ${species}_run.log
    augustus --species=$species $test_file | tee -a ${species}_run.log
    """

}
Juke34 commented 4 years ago

Sorry I did a mistake:

new_species.pl --species=asecodes_parviclava
etraining –-species=asecodes_parviclava outdir/TrainingData/codingGeneFeatures.filter.longest_cds.complete.good_distance_blast-filtered.gbk.train 
augustus --species=asecodes_parviclava TestingData/codingGeneFeatures.filter.longest_cds.complete.good_distance_blast-filtered.gbk.test | tee run.log

So you can remove line 3

Juke34 commented 4 years ago

I 'm currently testing the Augustus training at the end. The next step would be to add extra steps to train snap too. See here. But it needs 2 utility scripts one from MAKER the other one from GAAS. I could put the MAKER one in GAAS.

If we do so, we should also rename the pipeline into AbinitioTraining I guess.

mahesh-panchal commented 4 years ago

Did you add the Maker script to GAAS?

Juke34 commented 4 years ago

I forgot one. gaas_snap_train.sh is there now, but it is missing maker2zff. And gaas_snap_train.sh is not really needed, we can just install snap by conda and run

#!/bin/bash

NAME=$1

if [ -z "$NAME" ]
then
    echo "Must provide a name!"
else
    fathom -categorize 1000 genome.ann genome.dna
    fathom -export 1000 -plus uni.ann uni.dna
    forge export.ann export.dna
    hmm-assembler.pl $NAME . > $NAME.hmm
fi
kusalananda commented 4 years ago

Suggestion, quoting variable expansions and logically separating error handling from the main processing of the script:

#!/bin/bash

NAME=$1

if [ -z "$NAME" ]; then
        echo 'Must provide a name!' >&2
        exit 1
fi

fathom -categorize 1000 genome.ann genome.dna
fathom -export 1000 -plus uni.ann uni.dna
forge export.ann export.dna
hmm-assembler.pl "$NAME" . >"$NAME.hmm"
Juke34 commented 4 years ago

I have re-implement maker2zff into AGAT 0.2.3, it is called agat_converter_sp_gff2zff.pl. So we can skip this step

echo "##FASTA" >> annotation.gff
cat genome.fa >> annotation.gff

instead it will be something like that

cp augustus_training_result/BlastFilteredGFF/codingGeneFeatures.filter.longest_cds.complete.good_distance_blast-filtered.gff3 annotation.gff
ln -s genome/genome.fa genome.fa

# This should produce two files – genome.ann and genome.dna
agat_converter_sp_gff2zff.pl --gff annotation.gff --fasta genome.fa -o genome

# snap_train.sh is in GAAS but you can do it step by step as described previously
snap_train.sh <species_name>

The result file will be species_name.hmm