NBISweden / AGAT

Another Gtf/Gff Analysis Toolkit
GNU General Public License v3.0
457 stars 56 forks source link

agat_sp_extract_sequences.pl produces lots of warnings #53

Closed soungalo closed 1 year ago

soungalo commented 4 years ago

Hello, I am running the command: agat_sp_extract_sequences.pl -g genome.gff3 -f genome.fasta -p -o proteins_AGAT.fasta and keep getting warnings like this one:

WARNING: Problem ! The size of the sequence extracted 156 is different than the specified span: 155.
That often occurs when the fasta file does not correspond to the annotation file. Or the index file comes from another fasta file which had the same name and haven't been removed.
As last possibility your gff contains location errors (Already encountered for a Maker annotation)
Supplement information: seq_id=2_RaGOO ; seq_id_correct=2_RaGOO ; start=17685737 ; end=17685891 ; 2_RaGOO sequence length: 17716423 )

Can you please explain what this means and why I'm getting it? I'm pretty sure there is no problem with the fasta-gff correspondence or the indexing. Should I be worried or just ignore? Thanks!

Juke34 commented 4 years ago

The sequence extracted is different than the length expected. It can occur when you try to extract location outside the séquence boundaries or when we mess up the fasta index file. But I think it is not the case here.
Can you show me the whole CDS features related to the record mentioned by the warning?

soungalo commented 4 years ago
2_RaGOO indexed_genome  gene    17685116        17687318        .       -       .       ID=AT2G48080.1.path1;Name=AT2G4
8080.1;Dir=sense
2_RaGOO indexed_genome  mRNA    17685116        17687318        .       -       .       ID=AT2G48080.1.mrna1;Name=AT2G4
8080.1;Parent=AT2G48080.1.path1;Dir=sensecoverage=100.0;identity=100.0;matches=1619;mismatches=0;indels=0;unknowns=0
2_RaGOO indexed_genome  exon    17686736        17687318        100     -       .       ID=AT2G48080.1.mrna1.exon1;Name
=AT2G48080.1;Parent=AT2G48080.1.mrna1;Target=AT2G48080.1 1 583 +
2_RaGOO indexed_genome  exon    17686361        17686478        100     -       .       ID=AT2G48080.1.mrna1.exon2;Name
=AT2G48080.1;Parent=AT2G48080.1.mrna1;Target=AT2G48080.1 584 701 +
2_RaGOO indexed_genome  exon    17686172        17686273        100     -       .       ID=AT2G48080.1.mrna1.exon3;Name
=AT2G48080.1;Parent=AT2G48080.1.mrna1;Target=AT2G48080.1 702 803 +
2_RaGOO indexed_genome  exon    17685976        17686085        100     -       .       ID=AT2G48080.1.mrna1.exon4;Name
=AT2G48080.1;Parent=AT2G48080.1.mrna1;Target=AT2G48080.1 804 913 +
2_RaGOO indexed_genome  exon    17685737        17685891        100     -       .       ID=AT2G48080.1.mrna1.exon5;Name=AT2G48080.1;Parent=AT2G48080.1.mrna1;Target=AT2G48080.1 914 1068 +
2_RaGOO indexed_genome  exon    17685116        17685666        100     -       .       ID=AT2G48080.1.mrna1.exon6;Name=AT2G48080.1;Parent=AT2G48080.1.mrna1;Target=AT2G48080.1 1069 1619 +
2_RaGOO indexed_genome  CDS     17686736        17687218        100     -       0       ID=AT2G48080.1.mrna1.cds1;Name=AT2G48080.1;Parent=AT2G48080.1.mrna1;Target=AT2G48080.1 101 583 +
2_RaGOO indexed_genome  CDS     17686361        17686478        100     -       0       ID=AT2G48080.1.mrna1.cds2;Name=AT2G48080.1;Parent=AT2G48080.1.mrna1;Target=AT2G48080.1 584 701 +
2_RaGOO indexed_genome  CDS     17686172        17686273        100     -       1       ID=AT2G48080.1.mrna1.cds3;Name=AT2G48080.1;Parent=AT2G48080.1.mrna1;Target=AT2G48080.1 702 803 +
2_RaGOO indexed_genome  CDS     17685976        17686085        100     -       1       ID=AT2G48080.1.mrna1.cds4;Name=AT2G48080.1;Parent=AT2G48080.1.mrna1;Target=AT2G48080.1 804 913 +
2_RaGOO indexed_genome  CDS     17685737        17685891        100     -       0       ID=AT2G48080.1.mrna1.cds5;Name=AT2G48080.1;Parent=AT2G48080.1.mrna1;Target=AT2G48080.1 914 1068 +
2_RaGOO indexed_genome  CDS     17685318        17685666        100     -       2       ID=AT2G48080.1.mrna1.cds6;Name=AT2G48080.1;Parent=AT2G48080.1.mrna1;Target=AT2G48080.1 1069 1417 +
Juke34 commented 4 years ago

I have tested this case and I don't get any WARNING. Are you sure this case is throwing a Warning? If yes, test by removing the .index file and re-running the command.

Juke34 commented 4 years ago

You didn't give any news for a while. I close the issue. Feel free to reopen it if you still need help.

EmilyPhelps commented 1 year ago

Hello- I am having an issue with this warning also. I am trying to extracting some protein sequences from a maker annotation gff to train augustus and getting this error for many genes e.g.

WARNING: Problem ! The size of the extracted sequence 2329 is different than the specified span: 2328.
That often occurs when the fasta file does not correspond to the annotation file. Or the index file comes from another fasta file which had the same name and haven't been removed.
As last possibility your gff contains location errors (Already encountered for a Maker annotation)
Supplement information: seq_id=scaffold21 ; seq_id_correct=scaffold21 ; start=15333817 ; end=15336144 ; scaffold21 sequence length: 17624548 )

It states that this is an known issue with maker- is there a known solution? Here is the region in my gff:

scaffold7   maker   exon    5881887 5882034 .   +   .   ID=maker-scaffold7-augustus-gene-59.41-mRNA-1:4;Parent=maker-scaffold7-augustus-gene-59.41-mRNA-1
scaffold7   maker   CDS 5881887 5882034 .   +   1   ID=maker-scaffold7-augustus-gene-59.41-mRNA-1:cds;Parent=maker-scaffold7-augustus-gene-59.41-mRNA-1
Juke34 commented 1 year ago

In your case the problem is only one bp, it could also be to a problem when converting/handling annotation in different format: e.g. BED vs GFF, the first is base-0 while the later is base-1 (see https://agat.readthedocs.io/en/latest/gff_to_bed.html).

The region you show is not the one it complains (scaffold7 vs scaffold21) Remove you index file and retry. Also think to update AGAT to a recent version.

alexvasilikop commented 1 year ago

Hello,

I am having exactly the same problem using version 1.0.0 (anaconda installation). I tried to extract the cds.

You see below that all the checks passed but some fixes are made:


Using standard /home/lege/anaconda3/envs/agat/lib/perl5/site_perl/auto/share/dist/AGAT/config.yaml file
We will extract the cds sequences.
Reading file Neoechinorhynchus_agilis.gff3

 ------------------------------------------------------------------------------
|   Another GFF Analysis Toolkit (AGAT) - Version: v1.0.0                      |
|   https://github.com/NBISweden/AGAT                                          |
|   National Bioinformatics Infrastructure Sweden (NBIS) - www.nbis.se         |
 ------------------------------------------------------------------------------

                          ------ Start parsing ------                           
-------------------------- parse options and metadata --------------------------
=> Accessing the feature_levels YAML file
Using standard /home/lege/anaconda3/envs/agat/lib/perl5/site_perl/auto/share/dist/AGAT/feature_levels.yaml file
=> Attribute used to group features when no Parent/ID relationship exists (i.e common tag):
    * locus_tag
    * gene_id
=> merge_loci option deactivated
=> Machine information:
    This script is being run by perl v5.32.1
    Bioperl location being used: /home/lege/anaconda3/envs/agat/lib/perl5/site_perl/Bio/
    Operating system being used: linux 
=> Accessing Ontology
    No ontology accessible from the gff file header!
    We use the SOFA ontology distributed with AGAT:
        /home/lege/anaconda3/envs/agat/lib/perl5/site_perl/auto/share/dist/AGAT/so.obo
    Read ontology /home/lege/anaconda3/envs/agat/lib/perl5/site_perl/auto/share/dist/AGAT/so.obo:
        4 root terms, and 2596 total terms, and 1516 leaf terms
    Filtering ontology:
        We found 1861 terms that are sequence_feature or is_a child of it.
--------------------------------- parsing file ---------------------------------
=> Number of line in file: 96133
=> Number of comment lines: 1
=> Fasta included: No
=> Number of features lines: 96132
=> Number of feature type (3rd column): 7
    * Level1: 1 => gene
    * level2: 2 => tRNA mRNA
    * level3: 4 => CDS three_prime_UTR five_prime_UTR exon
    * unknown: 0 => 
=> Version of the Bioperl GFF parser selected by AGAT: 3
                  ------ End parsing (done in 16 second) ------                 

                           ------ Start checks ------                           
---------------------------- Check1: feature types -----------------------------
----------------------------------- ontology -----------------------------------
All feature types in agreement with the Ontology.
------------------------------------- agat -------------------------------------
AGAT can deal with all the encountered feature types (3rd column)
------------------------------ done in 0 seconds -------------------------------

------------------------------ Check2: duplicates ------------------------------
None found
------------------------------ done in 0 seconds -------------------------------

-------------------------- Check3: sequential bucket ---------------------------
Nothing to check as sequential bucket!
------------------------------ done in 0 seconds -------------------------------

--------------------------- Check4: l2 linked to l3 ----------------------------
No problem found
------------------------------ done in 0 seconds -------------------------------

--------------------------- Check5: l1 linked to l2 ----------------------------
No problem found
------------------------------ done in 0 seconds -------------------------------

--------------------------- Check6: remove orphan l1 ---------------------------
We remove only those not supposed to be orphan
None found
------------------------------ done in 0 seconds -------------------------------

------------------------- Check7: all level3 locations -------------------------
------------------------------ done in 2 seconds -------------------------------

------------------------------ Check8: check cds -------------------------------
No problem found
------------------------------ done in 0 seconds -------------------------------

----------------------------- Check9: check exons ------------------------------
No exons created
158 exons locations modified that were wrong
No supernumerary exons removed
No level2 locations modified
------------------------------ done in 1 seconds -------------------------------

----------------------------- Check10: check utrs ------------------------------
9 UTRs created that were missing
152 UTRs locations modified that were wrong
No supernumerary UTRs removed
------------------------------ done in 1 seconds -------------------------------

------------------------ Check11: all level2 locations -------------------------
No problem found
------------------------------ done in 2 seconds -------------------------------

------------------------ Check12: all level1 locations -------------------------
No problem found
------------------------------ done in 0 seconds -------------------------------

---------------------- Check13: remove identical isoforms ----------------------
None found
------------------------------ done in 0 seconds -------------------------------
                  ------ End checks (done in 6 second) ------                   

However, afterwards it throws many warnings and exception:

WARNING: Problem ! The size of the extracted sequence 1018 is different than the specified span: 1020.
That often occurs when the fasta file does not correspond to the annotation file. Or the index file comes from another fasta file which had the same name and haven't been removed.
As last possibility your gff contains location errors (Already encountered for a Maker annotation)
Supplement information: seq_id=scaffold_2 ; seq_id_correct=scaffold_2 ; start=16374858 ; end=16375877 ; scaffold_2 sequence length: 16376219 )

------------- EXCEPTION -------------
MSG: Failed validation of sequence '[unidentified sequence]'. Invalid characters were: >_3
STACK Bio::PrimarySeq::validate_seq /home/lege/anaconda3/envs/agat/lib/perl5/site_perl/Bio/PrimarySeq.pm:338
STACK Bio::PrimarySeq::_set_seq_by_ref /home/lege/anaconda3/envs/agat/lib/perl5/site_perl/Bio/PrimarySeq.pm:287
STACK Bio::PrimarySeq::seq /home/lege/anaconda3/envs/agat/lib/perl5/site_perl/Bio/PrimarySeq.pm:272
STACK Bio::PrimarySeq::new /home/lege/anaconda3/envs/agat/lib/perl5/site_perl/Bio/PrimarySeq.pm:229
STACK Bio::Seq::new /home/lege/anaconda3/envs/agat/lib/perl5/site_perl/Bio/Seq.pm:499
STACK main::create_seqObj /home/lege/anaconda3/envs/agat/bin/agat_sp_extract_sequences.pl:711
STACK main::extract_sequences /home/lege/anaconda3/envs/agat/bin/agat_sp_extract_sequences.pl:576
STACK toplevel /home/lege/anaconda3/envs/agat/bin/agat_sp_extract_sequences.pl:216
-------------------------------------

This is the region from the gff:

scaffold_2  funannotate gene    16374654    16375877    .   -   .   ID=FUN_008535;
scaffold_2  funannotate mRNA    16374654    16375877    .   -   .   ID=FUN_008535-T1;Parent=FUN_008535;product=hypothetical protein;
scaffold_2  funannotate exon    16374837    16375877    .   -   .   ID=FUN_008535-T1.exon1;Parent=FUN_008535-T1;
scaffold_2  funannotate exon    16374654    16374790    .   -   .   ID=FUN_008535-T1.exon2;Parent=FUN_008535-T1;
scaffold_2  funannotate three_prime_UTR 16374837    16374857    .   -   .   ID=FUN_008535-T1.utr3p1;Parent=FUN_008535-T1;
scaffold_2  funannotate three_prime_UTR 16374654    16374790    .   -   .   ID=FUN_008535-T1.utr3p2;Parent=FUN_008535-T1;
scaffold_2  funannotate CDS 16374858    16375877    .   -   0   ID=FUN_008535-T1.cds;Parent=FUN_008535-T1;

As you mentioned the difference is always 1bp or 2 bp which is strange because I did not make any file conversions (the gff file. I have difficulty understanding where the problem comes from for example:

WARNING: Problem ! The size of the extracted sequence 1018 is different than the specified span: 1020.

Why is there a difference between the extracted sequence and the specified span? Shouldn't the specified span in the gff used to extract the sequence of the same length by agat?

I assume since I am extracting cds, it should be divisible by 3 which is not for the extracted sequence. Why?

best and thanks Alex

Juke34 commented 1 year ago

The thrown EXCEPTION is due the header of a fasta sequence that sounds wrong. Could you look at >_3: grep ">_3" file.fasta

For the warning I need your example to test it myself... at least the fasta file

alexvasilikop commented 1 year ago

Hello,

This command gives no output so no match in the file:

grep ">_3" file.fasta

I send you the fasta file and gff file.

thanks Alex

Juke34 commented 1 year ago

Each contig sequence is one line in your file, this is not recommended in general, and it does not work with Bioperl. please try to fold your sequence first, it might fix your problem:

fold input.fa > output.fa

The fold command should work in most linux shell

alexvasilikop commented 1 year ago

Thanks a lot. After converting the fasta to interleaved format (as suggested) the warning disappears.

best

BernatBurriel commented 1 year ago

Hello! I'm having the same problem as several people here. I'm trying to extract coding regions from a gff file and its associated fasta but I'm having this 1bp misplacement. I've tried erasing the index file of the reference genome but I keeps appearing the same mistake in several CDS

****Thats the warning WARNING: Problem ! The size of the extracted sequence 97 is different than the specified span: 96. That often occurs when the fasta file does not correspond to the annotation file. Or the index file comes from another fasta file which had the same name and haven't been removed. As last possibility your gff contains location errors (Already encountered for a Maker annotation) Supplement information: seq_id=RL_2 ; seq_id_correct=RL_2 ; start=138302134 ; end=138302229 ; RL_2 sequence length: 209219848 )

**** This is the gff region RL_2 GeMoMa exon 138302134 138302229 . + . ID=exon-149158;Parent=anno2.Spha_rna-XM_048494373.1_R0;gene_id=g_4906;transcript_id=anno2.Spha_rna-XM_048494373.1_R0 RL_2 GeMoMa CDS 138302134 138302229 . + 0 ID=cds-312288;Parent=anno2.Spha_rna-XM_048494373.1_R0;gene_id=g_4906;transcript_id=anno2.Spha_rna-XM_048494373.1_R0