Closed soungalo closed 1 year ago
The sequence extracted is different than the length expected.
It can occur when you try to extract location outside the séquence boundaries or when we mess up the fasta index file. But I think it is not the case here.
Can you show me the whole CDS features related to the record mentioned by the warning?
2_RaGOO indexed_genome gene 17685116 17687318 . - . ID=AT2G48080.1.path1;Name=AT2G4
8080.1;Dir=sense
2_RaGOO indexed_genome mRNA 17685116 17687318 . - . ID=AT2G48080.1.mrna1;Name=AT2G4
8080.1;Parent=AT2G48080.1.path1;Dir=sensecoverage=100.0;identity=100.0;matches=1619;mismatches=0;indels=0;unknowns=0
2_RaGOO indexed_genome exon 17686736 17687318 100 - . ID=AT2G48080.1.mrna1.exon1;Name
=AT2G48080.1;Parent=AT2G48080.1.mrna1;Target=AT2G48080.1 1 583 +
2_RaGOO indexed_genome exon 17686361 17686478 100 - . ID=AT2G48080.1.mrna1.exon2;Name
=AT2G48080.1;Parent=AT2G48080.1.mrna1;Target=AT2G48080.1 584 701 +
2_RaGOO indexed_genome exon 17686172 17686273 100 - . ID=AT2G48080.1.mrna1.exon3;Name
=AT2G48080.1;Parent=AT2G48080.1.mrna1;Target=AT2G48080.1 702 803 +
2_RaGOO indexed_genome exon 17685976 17686085 100 - . ID=AT2G48080.1.mrna1.exon4;Name
=AT2G48080.1;Parent=AT2G48080.1.mrna1;Target=AT2G48080.1 804 913 +
2_RaGOO indexed_genome exon 17685737 17685891 100 - . ID=AT2G48080.1.mrna1.exon5;Name=AT2G48080.1;Parent=AT2G48080.1.mrna1;Target=AT2G48080.1 914 1068 +
2_RaGOO indexed_genome exon 17685116 17685666 100 - . ID=AT2G48080.1.mrna1.exon6;Name=AT2G48080.1;Parent=AT2G48080.1.mrna1;Target=AT2G48080.1 1069 1619 +
2_RaGOO indexed_genome CDS 17686736 17687218 100 - 0 ID=AT2G48080.1.mrna1.cds1;Name=AT2G48080.1;Parent=AT2G48080.1.mrna1;Target=AT2G48080.1 101 583 +
2_RaGOO indexed_genome CDS 17686361 17686478 100 - 0 ID=AT2G48080.1.mrna1.cds2;Name=AT2G48080.1;Parent=AT2G48080.1.mrna1;Target=AT2G48080.1 584 701 +
2_RaGOO indexed_genome CDS 17686172 17686273 100 - 1 ID=AT2G48080.1.mrna1.cds3;Name=AT2G48080.1;Parent=AT2G48080.1.mrna1;Target=AT2G48080.1 702 803 +
2_RaGOO indexed_genome CDS 17685976 17686085 100 - 1 ID=AT2G48080.1.mrna1.cds4;Name=AT2G48080.1;Parent=AT2G48080.1.mrna1;Target=AT2G48080.1 804 913 +
2_RaGOO indexed_genome CDS 17685737 17685891 100 - 0 ID=AT2G48080.1.mrna1.cds5;Name=AT2G48080.1;Parent=AT2G48080.1.mrna1;Target=AT2G48080.1 914 1068 +
2_RaGOO indexed_genome CDS 17685318 17685666 100 - 2 ID=AT2G48080.1.mrna1.cds6;Name=AT2G48080.1;Parent=AT2G48080.1.mrna1;Target=AT2G48080.1 1069 1417 +
I have tested this case and I don't get any WARNING. Are you sure this case is throwing a Warning?
If yes, test by removing the .index
file and re-running the command.
You didn't give any news for a while. I close the issue. Feel free to reopen it if you still need help.
Hello- I am having an issue with this warning also. I am trying to extracting some protein sequences from a maker annotation gff to train augustus and getting this error for many genes e.g.
WARNING: Problem ! The size of the extracted sequence 2329 is different than the specified span: 2328.
That often occurs when the fasta file does not correspond to the annotation file. Or the index file comes from another fasta file which had the same name and haven't been removed.
As last possibility your gff contains location errors (Already encountered for a Maker annotation)
Supplement information: seq_id=scaffold21 ; seq_id_correct=scaffold21 ; start=15333817 ; end=15336144 ; scaffold21 sequence length: 17624548 )
It states that this is an known issue with maker- is there a known solution? Here is the region in my gff:
scaffold7 maker exon 5881887 5882034 . + . ID=maker-scaffold7-augustus-gene-59.41-mRNA-1:4;Parent=maker-scaffold7-augustus-gene-59.41-mRNA-1
scaffold7 maker CDS 5881887 5882034 . + 1 ID=maker-scaffold7-augustus-gene-59.41-mRNA-1:cds;Parent=maker-scaffold7-augustus-gene-59.41-mRNA-1
In your case the problem is only one bp, it could also be to a problem when converting/handling annotation in different format: e.g. BED vs GFF, the first is base-0 while the later is base-1 (see https://agat.readthedocs.io/en/latest/gff_to_bed.html).
The region you show is not the one it complains (scaffold7 vs scaffold21) Remove you index file and retry. Also think to update AGAT to a recent version.
Hello,
I am having exactly the same problem using version 1.0.0 (anaconda installation). I tried to extract the cds.
You see below that all the checks passed but some fixes are made:
Using standard /home/lege/anaconda3/envs/agat/lib/perl5/site_perl/auto/share/dist/AGAT/config.yaml file
We will extract the cds sequences.
Reading file Neoechinorhynchus_agilis.gff3
------------------------------------------------------------------------------
| Another GFF Analysis Toolkit (AGAT) - Version: v1.0.0 |
| https://github.com/NBISweden/AGAT |
| National Bioinformatics Infrastructure Sweden (NBIS) - www.nbis.se |
------------------------------------------------------------------------------
------ Start parsing ------
-------------------------- parse options and metadata --------------------------
=> Accessing the feature_levels YAML file
Using standard /home/lege/anaconda3/envs/agat/lib/perl5/site_perl/auto/share/dist/AGAT/feature_levels.yaml file
=> Attribute used to group features when no Parent/ID relationship exists (i.e common tag):
* locus_tag
* gene_id
=> merge_loci option deactivated
=> Machine information:
This script is being run by perl v5.32.1
Bioperl location being used: /home/lege/anaconda3/envs/agat/lib/perl5/site_perl/Bio/
Operating system being used: linux
=> Accessing Ontology
No ontology accessible from the gff file header!
We use the SOFA ontology distributed with AGAT:
/home/lege/anaconda3/envs/agat/lib/perl5/site_perl/auto/share/dist/AGAT/so.obo
Read ontology /home/lege/anaconda3/envs/agat/lib/perl5/site_perl/auto/share/dist/AGAT/so.obo:
4 root terms, and 2596 total terms, and 1516 leaf terms
Filtering ontology:
We found 1861 terms that are sequence_feature or is_a child of it.
--------------------------------- parsing file ---------------------------------
=> Number of line in file: 96133
=> Number of comment lines: 1
=> Fasta included: No
=> Number of features lines: 96132
=> Number of feature type (3rd column): 7
* Level1: 1 => gene
* level2: 2 => tRNA mRNA
* level3: 4 => CDS three_prime_UTR five_prime_UTR exon
* unknown: 0 =>
=> Version of the Bioperl GFF parser selected by AGAT: 3
------ End parsing (done in 16 second) ------
------ Start checks ------
---------------------------- Check1: feature types -----------------------------
----------------------------------- ontology -----------------------------------
All feature types in agreement with the Ontology.
------------------------------------- agat -------------------------------------
AGAT can deal with all the encountered feature types (3rd column)
------------------------------ done in 0 seconds -------------------------------
------------------------------ Check2: duplicates ------------------------------
None found
------------------------------ done in 0 seconds -------------------------------
-------------------------- Check3: sequential bucket ---------------------------
Nothing to check as sequential bucket!
------------------------------ done in 0 seconds -------------------------------
--------------------------- Check4: l2 linked to l3 ----------------------------
No problem found
------------------------------ done in 0 seconds -------------------------------
--------------------------- Check5: l1 linked to l2 ----------------------------
No problem found
------------------------------ done in 0 seconds -------------------------------
--------------------------- Check6: remove orphan l1 ---------------------------
We remove only those not supposed to be orphan
None found
------------------------------ done in 0 seconds -------------------------------
------------------------- Check7: all level3 locations -------------------------
------------------------------ done in 2 seconds -------------------------------
------------------------------ Check8: check cds -------------------------------
No problem found
------------------------------ done in 0 seconds -------------------------------
----------------------------- Check9: check exons ------------------------------
No exons created
158 exons locations modified that were wrong
No supernumerary exons removed
No level2 locations modified
------------------------------ done in 1 seconds -------------------------------
----------------------------- Check10: check utrs ------------------------------
9 UTRs created that were missing
152 UTRs locations modified that were wrong
No supernumerary UTRs removed
------------------------------ done in 1 seconds -------------------------------
------------------------ Check11: all level2 locations -------------------------
No problem found
------------------------------ done in 2 seconds -------------------------------
------------------------ Check12: all level1 locations -------------------------
No problem found
------------------------------ done in 0 seconds -------------------------------
---------------------- Check13: remove identical isoforms ----------------------
None found
------------------------------ done in 0 seconds -------------------------------
------ End checks (done in 6 second) ------
However, afterwards it throws many warnings and exception:
WARNING: Problem ! The size of the extracted sequence 1018 is different than the specified span: 1020.
That often occurs when the fasta file does not correspond to the annotation file. Or the index file comes from another fasta file which had the same name and haven't been removed.
As last possibility your gff contains location errors (Already encountered for a Maker annotation)
Supplement information: seq_id=scaffold_2 ; seq_id_correct=scaffold_2 ; start=16374858 ; end=16375877 ; scaffold_2 sequence length: 16376219 )
------------- EXCEPTION -------------
MSG: Failed validation of sequence '[unidentified sequence]'. Invalid characters were: >_3
STACK Bio::PrimarySeq::validate_seq /home/lege/anaconda3/envs/agat/lib/perl5/site_perl/Bio/PrimarySeq.pm:338
STACK Bio::PrimarySeq::_set_seq_by_ref /home/lege/anaconda3/envs/agat/lib/perl5/site_perl/Bio/PrimarySeq.pm:287
STACK Bio::PrimarySeq::seq /home/lege/anaconda3/envs/agat/lib/perl5/site_perl/Bio/PrimarySeq.pm:272
STACK Bio::PrimarySeq::new /home/lege/anaconda3/envs/agat/lib/perl5/site_perl/Bio/PrimarySeq.pm:229
STACK Bio::Seq::new /home/lege/anaconda3/envs/agat/lib/perl5/site_perl/Bio/Seq.pm:499
STACK main::create_seqObj /home/lege/anaconda3/envs/agat/bin/agat_sp_extract_sequences.pl:711
STACK main::extract_sequences /home/lege/anaconda3/envs/agat/bin/agat_sp_extract_sequences.pl:576
STACK toplevel /home/lege/anaconda3/envs/agat/bin/agat_sp_extract_sequences.pl:216
-------------------------------------
This is the region from the gff:
scaffold_2 funannotate gene 16374654 16375877 . - . ID=FUN_008535;
scaffold_2 funannotate mRNA 16374654 16375877 . - . ID=FUN_008535-T1;Parent=FUN_008535;product=hypothetical protein;
scaffold_2 funannotate exon 16374837 16375877 . - . ID=FUN_008535-T1.exon1;Parent=FUN_008535-T1;
scaffold_2 funannotate exon 16374654 16374790 . - . ID=FUN_008535-T1.exon2;Parent=FUN_008535-T1;
scaffold_2 funannotate three_prime_UTR 16374837 16374857 . - . ID=FUN_008535-T1.utr3p1;Parent=FUN_008535-T1;
scaffold_2 funannotate three_prime_UTR 16374654 16374790 . - . ID=FUN_008535-T1.utr3p2;Parent=FUN_008535-T1;
scaffold_2 funannotate CDS 16374858 16375877 . - 0 ID=FUN_008535-T1.cds;Parent=FUN_008535-T1;
As you mentioned the difference is always 1bp or 2 bp which is strange because I did not make any file conversions (the gff file. I have difficulty understanding where the problem comes from for example:
WARNING: Problem ! The size of the extracted sequence 1018 is different than the specified span: 1020.
Why is there a difference between the extracted sequence and the specified span? Shouldn't the specified span in the gff used to extract the sequence of the same length by agat?
I assume since I am extracting cds, it should be divisible by 3 which is not for the extracted sequence. Why?
best and thanks Alex
The thrown EXCEPTION is due the header of a fasta sequence that sounds wrong.
Could you look at >_3
: grep ">_3" file.fasta
For the warning I need your example to test it myself... at least the fasta file
Hello,
This command gives no output so no match in the file:
grep ">_3" file.fasta
I send you the fasta file and gff file.
thanks Alex
Each contig sequence is one line in your file, this is not recommended in general, and it does not work with Bioperl. please try to fold your sequence first, it might fix your problem:
fold input.fa > output.fa
The fold command should work in most linux shell
Thanks a lot. After converting the fasta to interleaved format (as suggested) the warning disappears.
best
Hello! I'm having the same problem as several people here. I'm trying to extract coding regions from a gff file and its associated fasta but I'm having this 1bp misplacement. I've tried erasing the index file of the reference genome but I keeps appearing the same mistake in several CDS
****Thats the warning WARNING: Problem ! The size of the extracted sequence 97 is different than the specified span: 96. That often occurs when the fasta file does not correspond to the annotation file. Or the index file comes from another fasta file which had the same name and haven't been removed. As last possibility your gff contains location errors (Already encountered for a Maker annotation) Supplement information: seq_id=RL_2 ; seq_id_correct=RL_2 ; start=138302134 ; end=138302229 ; RL_2 sequence length: 209219848 )
**** This is the gff region RL_2 GeMoMa exon 138302134 138302229 . + . ID=exon-149158;Parent=anno2.Spha_rna-XM_048494373.1_R0;gene_id=g_4906;transcript_id=anno2.Spha_rna-XM_048494373.1_R0 RL_2 GeMoMa CDS 138302134 138302229 . + 0 ID=cds-312288;Parent=anno2.Spha_rna-XM_048494373.1_R0;gene_id=g_4906;transcript_id=anno2.Spha_rna-XM_048494373.1_R0
Hello, I am running the command:
agat_sp_extract_sequences.pl -g genome.gff3 -f genome.fasta -p -o proteins_AGAT.fasta
and keep getting warnings like this one:Can you please explain what this means and why I'm getting it? I'm pretty sure there is no problem with the fasta-gff correspondence or the indexing. Should I be worried or just ignore? Thanks!