NBISweden / pipelines-nextflow

A set of workflows written in Nextflow for Genome Annotation.
GNU General Public License v3.0
43 stars 18 forks source link

Add Tiny test datasets #1

Closed Juke34 closed 4 years ago

Juke34 commented 4 years ago

From @mahesh-panchal: The nextflow pipelines need small data sets that take less than 5 minutes to run.

Juke34 commented 4 years ago

potential dataset from nf-core preprocessing pipeline: https://github.com/nf-core/test-datasets/blob/rnaseq/reference/genome.fa

assembly pipeline: https://github.com/nf-core/test-datasets/tree/rnaseq/testdata + https://github.com/nf-core/test-datasets/blob/rnaseq/reference/genome.fa

AbinitioTraining; probably https://github.com/nf-core/test-datasets/blob/rnaseq/reference/genes.gff + https://github.com/nf-core/test-datasets/blob/rnaseq/reference/genome.fa

mahesh-panchal commented 4 years ago

Annotation preprocessing workflow: Data is suitable. Augustus training workflow: Data is unsuitable. Functional annotation workflow: Data is unsuitable. Transcript assembly workflow: Data is single end only.

Juke34 commented 4 years ago

Why the data is unsuitable for the Augustus training workflow? What is wrong with data? Too large?

mahesh-panchal commented 4 years ago

It's not the correct format. It throws an error.

mahesh-panchal commented 4 years ago

Maybe use some sequence data simulation tools and take a 2kb stretch of genome and make the necessary data files.

Juke34 commented 4 years ago

Using AGAT v0.1.1 now we can use the weird gff file stored by nf-core.

Juke34 commented 4 years ago

What about Functional annotation pipeline? what is the problem with the data?

mahesh-panchal commented 4 years ago

On branch add_test_profile :

nextflow run -profile nbis,conda,test $NBIS_ANNOTATION_REPO/TranscriptAssembly 👍

nextflow run -profile nbis,conda,test $NBIS_ANNOTATION_REPO/AnnotationPreprocessing 👍

nextflow run -profile nbis,conda,test $NBIS_ANNOTATION_REPO/AugustusTraining 👎 Error (shortened):

Error executing process > 'augustus_training_dataset:split_maker_evidence (genes)'                                                 
Caused by:                                                                                                   
                       Missing output file(s) `maker_results_noAbinitio_clean/mrna.gff` expected by process `augustus_training_dataset:split_maker_evidence (genes)`

Command executed:

  agat_sp_split_by_level2_feature.pl -g genes.gff -o maker_results_noAbinitio_clean

Command exit status:
  0

Command output:
  GFF3 file parsed
  Job done in 1 seconds

nextflow run -profile nbis,conda,test $NBIS_ANNOTATION_REPO/FunctionalAnnotationPreparation 👎 Error (shortened):

21/02/2020 16:31:49:652 Welcome to InterProScan-5.38-76.0
21/02/2020 16:31:49:654 Running InterProScan v5 in STANDALONE mode... on Linux
21/02/2020 16:32:06:418 Loading file /scratch/nxf.AP8XJ4hzfI/genes_proteins.1.fasta
21/02/2020 16:32:06:421 Running the following analyses:
[CDD-3.17,Coils-2.2.1,Gene3D-4.2.0,Hamap-2019_01,MobiDBLite-2.0,PANTHER-14.1,Pfam-32.0,PIRSF-3.02,PRINTS-42.0,ProSitePatterns-2019_01,ProSiteProfiles-2019_01,SFLD-4,SMART-7.1,SUPERFAMILY-1.75,TIGRFAM-15.0]
Pre-calculated match lookup service DISABLED.  Please wait for match calculations to complete...
ERROR: uk.ac.ebi.interpro.scan.jms.worker.LocalJobQueueListener - Execution thrown when attempting to executeInTransaction the StepExecution.  All database activity rolled back.
java.lang.IllegalArgumentException: 'sequence' is not an amino acid sequence []

Erroneous file: /home/mahpa906/Nextflow/NBIS_annotation_pipeline_test/work/88/6cf398d41520b862b2022f0b38c0f0/genes_proteins.1.fasta There appears to be empty sequence present.

mahesh-panchal commented 4 years ago

The GFF file has 119 CDS entries, but the fasta only has 116 sequences with the last sequence being empty. Is it a bug in agat_sp_extract_sequences.pl?

Juke34 commented 4 years ago

So the problem is that we have transcript features and the pipeline expects mrna one. So the file out from the first step is called then transcript.gff while the pipeline expects mrna.gff. As both are really common I suggest to add a piece of code in the pipeline: if both exist, I guess the best would be to concatenate them, but if only one of them exists use this one indifferently. Is it possible?

mahesh-panchal commented 4 years ago

So do you mean add this to the script?

if test -f maker_results_noAbinitio_clean/mrna.gff && test -f maker_results_noAbinitio_clean/transcripts.gff; then
    cat maker_results_noAbinitio_clean/{mrna,transcripts}.gff > mrna_transcripts.gff
elif test -f maker_results_noAbinitio_clean/mrna.gff; then
    cp maker_results_noAbinitio_clean/mrna.gff mrna_transcripts.gff
elif maker_results_noAbinitio_clean/transcripts.gff; then
    cp maker_results_noAbinitio_clean/transcripts.gff mrna_transcripts.gff
fi 
Juke34 commented 4 years ago

The GFF file has 119 CDS entries, but the fasta only has 116 sequences with the last sequence being empty. Is it a bug in agat_sp_extract_sequences.pl?

Not a bug, CDS can span several locations in eukaryote (splicing), so we call that multi-line features, each location is a CDS feature, and when we concatenate all the CDS pieces together it represents the CDS as it is within the mature RNA, i.e:

I       ensembl transcript      147594  151166  .       -       .       ID=YAL001C;Parent=YAL001C;geneID=YAL001C;gene_biotype=protein_coding;gene_name=TFC3;gene_source=ensembl;gene_version=1;p_id=P6179;transcript_biotype=protein_coding;transcript_name=TFC3;transcript_source=ensembl;transcript_version=1;tss_id=TSS1988
I       ensembl exon    147594  151006  .       -       .       Parent=YAL001C;exon_id=YAL001C.2;exon_number=2;exon_version=1;gene_biotype=protein_coding;gene_name=TFC3;gene_source=ensembl;gene_version=1;p_id=P6179;transcript_biotype=protein_coding;transcript_name=TFC3;transcript_source=ensembl;transcript_version=1;tss_id=TSS1988
I       ensembl exon    151097  151166  .       -       .       Parent=YAL001C;exon_id=YAL001C.1;exon_number=1;exon_version=1;gene_biotype=protein_coding;gene_name=TFC3;gene_source=ensembl;gene_version=1;p_id=P6179;transcript_biotype=protein_coding;transcript_name=TFC3;transcript_source=ensembl;transcript_version=1;tss_id=TSS1988
I       ensembl CDS     147594  151006  .       -       2       Parent=YAL001C;exon_number=2;gene_biotype=protein_coding;gene_name=TFC3;gene_source=ensembl;gene_version=1;p_id=P6179;protein_id=YAL001C;protein_version=1;transcript_biotype=protein_coding;transcript_name=TFC3;transcript_source=ensembl;transcript_version=1;tss_id=TSS1988
I       ensembl CDS     151097  151166  .       -       0       Parent=YAL001C;exon_number=1;gene_biotype=protein_coding;gene_name=TFC3;gene_source=ensembl;gene_version=1;p_id=P6179;protein_id=YAL001C;protein_version=1;transcript_biotype=protein_coding;transcript_name=TFC3;transcript_source=ensembl;transcript_version=1;tss_id=TSS1988

To make it short number of CDS not necessarily = to number of coding mRNA/transcript in GFF/GTF

Juke34 commented 4 years ago

If think should be enough:

if test -f maker_results_noAbinitio_clean/mrna.gff && test -f maker_results_noAbinitio_clean/transcript.gff; then
    cat maker_results_noAbinitio_clean/transcript.gff > mrna.gff
elif test -f maker_results_noAbinitio_clean/transcript.gff; then
    cp maker_results_noAbinitio_clean/transcript.gff mrna.gff
fi 
mahesh-panchal commented 4 years ago

If think should be enough:

if test -f maker_results_noAbinitio_clean/mrna.gff && test -f maker_results_noAbinitio_clean/transcript.gff; then
    cat maker_results_noAbinitio_clean/transcript.gff > mrna.gff
elif test -f maker_results_noAbinitio_clean/transcript.gff; then
    cp maker_results_noAbinitio_clean/transcript.gff mrna.gff
fi 

So I can just overwrite the mrna.gff with transcript.gff, regardless if mrna.gff exists?

mahesh-panchal commented 4 years ago

The GFF file has 119 CDS entries, but the fasta only has 116 sequences with the last sequence being empty. Is it a bug in agat_sp_extract_sequences.pl?

Not a bug, CDS can span several locations in eukaryote (splicing), so we call that multi-line features, each location is a CDS feature, and when we concatenate all the CDS pieces together it represents the CDS as it is within the mature RNA, i.e:

I       ensembl transcript      147594  151166  .       -       .       ID=YAL001C;Parent=YAL001C;geneID=YAL001C;gene_biotype=protein_coding;gene_name=TFC3;gene_source=ensembl;gene_version=1;p_id=P6179;transcript_biotype=protein_coding;transcript_name=TFC3;transcript_source=ensembl;transcript_version=1;tss_id=TSS1988
I       ensembl exon    147594  151006  .       -       .       Parent=YAL001C;exon_id=YAL001C.2;exon_number=2;exon_version=1;gene_biotype=protein_coding;gene_name=TFC3;gene_source=ensembl;gene_version=1;p_id=P6179;transcript_biotype=protein_coding;transcript_name=TFC3;transcript_source=ensembl;transcript_version=1;tss_id=TSS1988
I       ensembl exon    151097  151166  .       -       .       Parent=YAL001C;exon_id=YAL001C.1;exon_number=1;exon_version=1;gene_biotype=protein_coding;gene_name=TFC3;gene_source=ensembl;gene_version=1;p_id=P6179;transcript_biotype=protein_coding;transcript_name=TFC3;transcript_source=ensembl;transcript_version=1;tss_id=TSS1988
I       ensembl CDS     147594  151006  .       -       2       Parent=YAL001C;exon_number=2;gene_biotype=protein_coding;gene_name=TFC3;gene_source=ensembl;gene_version=1;p_id=P6179;protein_id=YAL001C;protein_version=1;transcript_biotype=protein_coding;transcript_name=TFC3;transcript_source=ensembl;transcript_version=1;tss_id=TSS1988
I       ensembl CDS     151097  151166  .       -       0       Parent=YAL001C;exon_number=1;gene_biotype=protein_coding;gene_name=TFC3;gene_source=ensembl;gene_version=1;p_id=P6179;protein_id=YAL001C;protein_version=1;transcript_biotype=protein_coding;transcript_name=TFC3;transcript_source=ensembl;transcript_version=1;tss_id=TSS1988

To make it short number of CDS not necessarily = to number of coding mRNA/transcript in GFF/GTF

There is still the issue of the empty sequence being written. What should we do to solve it?

Juke34 commented 4 years ago

Sorry no, my mistake it should be

cat maker_results_noAbinitio_clean/transcript.gff >> mrna.gff

not

cat maker_results_noAbinitio_clean/transcript.gff > mrna.gff
mahesh-panchal commented 4 years ago

Sorry no, my mistake it should be

cat maker_results_noAbinitio_clean/transcript.gff >> mrna.gff

not

cat maker_results_noAbinitio_clean/transcript.gff > mrna.gff

Do we need to worry about gff headers?

Juke34 commented 4 years ago

No we can discard it if you prefer. But I think AGAT, will at least add ##gff-version 3

Juke34 commented 4 years ago

ArF I understand what you mean now, yes we should worry about the header when we append the file. We should skip all lines starting by # This would be wrong

##gff-version 3
I       ensembl transcript      147594  151166  .       -       .       ID=YAL001C;Parent=YAL001C;geneID=YAL001C;gene_biotype=protein_coding;gene_name=TFC3;gene_source=ensembl;gene_version=1;p_id=P6179;transcript_biotype=protein_coding;transcript_name=TFC3;transcript_source=ensembl;transcript_version=1;tss_id=TSS1988
I       ensembl exon    147594  151006  .       -       .       Parent=YAL001C;exon_id=YAL001C.2;exon_number=2;exon_version=1;gene_biotype=protein_coding;gene_name=TFC3;gene_source=ensembl;gene_version=1;p_id=P6179;transcript_biotype=protein_coding;transcript_name=TFC3;transcript_source=ensembl;transcript_version=1;tss_id=TSS1988
I       ensembl exon    151097  151166  .       -       .       Parent=YAL001C;exon_id=YAL001C.1;exon_number=1;exon_version=1;gene_biotype=protein_coding;gene_name=TFC3;gene_source=ensembl;gene_version=1;p_id=P6179;transcript_biotype=protein_coding;transcript_name=TFC3;transcript_source=ensembl;transcript_version=1;tss_id=TSS1988
I       ensembl CDS     147594  151006  .       -       2       Parent=YAL001C;exon_number=2;gene_biotype=protein_coding;gene_name=TFC3;gene_source=ensembl;gene_version=1;p_id=P6179;protein_id=YAL001C;protein_version=1;transcript_biotype=protein_coding;transcript_name=TFC3;transcript_source=ensembl;transcript_version=1;tss_id=TSS1988
I       ensembl CDS     151097  151166  .       -       0       Parent=YAL001C;exon_number=1;gene_biotype=protein_coding;gene_name=TFC3;gene_source=ensembl;gene_version=1;p_id=P6179;protein_id=YAL001C;protein_version=1;transcript_biotype=protein_coding;transcript_name=TFC3;transcript_source=ensembl;transcript_version=1;tss_id=TSS1988
##gff-version 3
I       ensembl transcript      147594  151166  .       -       .       ID=YAL001C;Parent=YAL001C;geneID=YAL001C;gene_biotype=protein_coding;gene_name=TFC3;gene_source=ensembl;gene_version=1;p_id=P6179;transcript_biotype=protein_coding;transcript_name=TFC3;transcript_source=ensembl;transcript_version=1;tss_id=TSS1988
I       ensembl exon    147594  151006  .       -       .       Parent=YAL001C;exon_id=YAL001C.2;exon_number=2;exon_version=1;gene_biotype=protein_coding;gene_name=TFC3;gene_source=ensembl;gene_version=1;p_id=P6179;transcript_biotype=protein_coding;transcript_name=TFC3;transcript_source=ensembl;transcript_version=1;tss_id=TSS1988
I       ensembl exon    151097  151166  .       -       .       Parent=YAL001C;exon_id=YAL001C.1;exon_number=1;exon_version=1;gene_biotype=protein_coding;gene_name=TFC3;gene_source=ensembl;gene_version=1;p_id=P6179;transcript_biotype=protein_coding;transcript_name=TFC3;transcript_source=ensembl;transcript_version=1;tss_id=TSS1988
I       ensembl CDS     147594  151006  .       -       2       Parent=YAL001C;exon_number=2;gene_biotype=protein_coding;gene_name=TFC3;gene_source=ensembl;gene_version=1;p_id=P6179;protein_id=YAL001C;protein_version=1;transcript_biotype=protein_coding;transcript_name=TFC3;transcript_source=ensembl;transcript_version=1;tss_id=TSS1988
I       ensembl CDS     151097  151166  .       -       0       Parent=YAL001C;exon_number=1;gene_biotype=protein_coding;gene_name=TFC3;gene_source=ensembl;gene_version=1;p_id=P6179;protein_id=YAL001C;protein_version=1;transcript_biotype=protein_coding;transcript_name=TFC3;transcript_source=ensembl;transcript_version=1;tss_id=TSS1988
mahesh-panchal commented 4 years ago

Does your AGAT toolkit have a merge tool?

Juke34 commented 4 years ago

Yes actually the easier is probably in this case to use agat_sp_merge_annotations.pl

mahesh-panchal commented 4 years ago

OK. split_maker_evidence process patched. New error:

Oops .. something went wrong
Error executing process > 'augustus_training_dataset:blast_makeblastdb (codingGeneFeatures.filter.longest_cds.complete.good_distance_proteins type: null)'

Caused by:
  Process `augustus_training_dataset:blast_makeblastdb (codingGeneFeatures.filter.longest_cds.complete.good_distance_proteins type: null)` terminated with an error exit status (1)

Command executed:

  makeblastdb -in codingGeneFeatures.filter.longest_cds.complete.good_distance_proteins.fasta -dbtype prot

Command exit status:
  1

Command output:

  Building a new DB, current time: 02/24/2020 13:03:18
  New DB name:   /scratch/nxf.PgrGTwO6Wc/codingGeneFeatures.filter.longest_cds.complete.good_distance_proteins.fasta
  New DB title:  codingGeneFeatures.filter.longest_cds.complete.good_distance_proteins.fasta
  Sequence type: Protein
  Keep MBits: T
  Maximum file size: 1000000000B

Command error:
  BLAST options error: File codingGeneFeatures.filter.longest_cds.complete.good_distance_proteins.fasta is empty

Work dir:
  /home/mahpa906/Nextflow/NBIS_annotation_pipeline_test/work/c0/511a8362016d5813c409a54047c99b

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`
mahesh-panchal commented 4 years ago

nextflow run -profile nbis,conda,test $NBIS_ANNOTATION_REPO/AugustusTraining 👍

Juke34 commented 4 years ago

Excellent!

Juke34 commented 4 years ago

nextflow run -profile nbis,conda,test $NBIS_ANNOTATION_REPO/FunctionalAnnotationPreparation still does not work but at least no more empty sequence in the protein file

Invalid input specified for -appl/--applications parameter:
Analysis all does not exist or is deactivated.
Juke34 commented 4 years ago

Ok it comes form -appl all that does not exist. in Bpipe I was doing

    var db_list : ""
    if (INTERPRO_DB_LIST.toLowerCase() != "all"){
        db_list = "-appl $INTERPRO_DB_LIST"
    } 

    produce(input+".gff3",input+".tsv",input+".xml") {
        exec "$INTERPROSCAN $db_list -i $input -d ${output.dir} -iprlookup -goterms -pa -dp > /dev/null 2> /dev/null ","interpro"
        }

So in order to use all DB no parameter should be set, and if we specify Databases then we need to set -appl with the list of DB e.g -appl PfamA-26.0,TIGRFAM-12.0,ProDom-2006.1,PIRSF-2.81

mahesh-panchal commented 4 years ago

nextflow run -profile nbis,conda,test $NBIS_ANNOTATION_REPO/FunctionalAnnotationPreparation 👍