NBISweden / AGAT

Another Gtf/Gff Analysis Toolkit
GNU General Public License v3.0
465 stars 56 forks source link

agat_sp_merge_annotations.pl fails to merge? #325

Closed dariober closed 1 year ago

dariober commented 1 year ago

Hi- thanks for the great set of tools! This is with agat 1.0.0 from bioconda. I cannot understand why the features in these two gff files are not merged by agat_sp_merge_annotations.pl:

cat b.gff3 
chr1    AUGUSTUS    gene    1000424 1039237 .   +   .   ID=A;
chr1    AUGUSTUS    mRNA    1000424 1039237 .   +   .   ID=A.t1;Parent=A;

cat g.gff3 
chr1    AUGUSTUS    gene    1000424 1039237 .   +   .   ID=B;
chr1    AUGUSTUS    mRNA    1000424 1039237 .   +   .   ID=B.t1;Parent=B;

I would expect the merged output to contain one gene and one mRNA only. Instead, the two genes remain distinct:

agat_sp_merge_annotations.pl -f b.gff3 -f g.gff3 -o m.gff3

cat m.gff3 
##gff-version 3
chr1    AUGUSTUS    gene    1000424 1039237 .   +   .   ID=A
chr1    AUGUSTUS    mRNA    1000424 1039237 .   +   .   ID=A.t1;Parent=A
chr1    AUGUSTUS    gene    1000424 1039237 .   +   .   ID=B
chr1    AUGUSTUS    mRNA    1000424 1039237 .   +   .   ID=B.t1;Parent=B

Output from agat_sp_merge_annotations below. Am I missing something?

Using standard /export/III-data/wcmp_bioinformatics/db291g/miniconda3/envs/genomeAnnotationPipeline/lib/perl5/site_perl/auto/share/dist/AGAT/config.yaml file
********************************************************************************
*                              - Start parsing -                               *
********************************************************************************
-------------------------- parse options and metadata --------------------------
=> Accessing the feature_levels YAML file
Using standard /export/III-data/wcmp_bioinformatics/db291g/miniconda3/envs/genomeAnnotationPipeline/lib/perl5/site_perl/auto/share/dist/AGAT/feature_levels.yaml file
=> Attribute used to group features when no Parent/ID relationship exists (i.e common tag):
    * locus_tag
    * gene_id
=> merge_loci option deactivated
=> Machine information:
    This script is being run by perl v5.32.1
    Bioperl location being used: /export/III-data/wcmp_bioinformatics/db291g/miniconda3/envs/genomeAnnotationPipeline/lib/perl5/site_perl/Bio/
    Operating system being used: linux 
=> Accessing Ontology
    No ontology accessible from the gff file header!
    We use the SOFA ontology distributed with AGAT:
        /export/III-data/wcmp_bioinformatics/db291g/miniconda3/envs/genomeAnnotationPipeline/lib/perl5/site_perl/auto/share/dist/AGAT/so.obo
    Read ontology /export/III-data/wcmp_bioinformatics/db291g/miniconda3/envs/genomeAnnotationPipeline/lib/perl5/site_perl/auto/share/dist/AGAT/so.obo:
        4 root terms, and 2596 total terms, and 1516 leaf terms
    Filtering ontology:
        We found 1861 terms that are sequence_feature or is_a child of it.
--------------------------------- parsing file ---------------------------------
=> Number of line in file: 2
=> Number of comment lines: 0
=> Fasta included: No
=> Number of features lines: 2
=> Number of feature type (3rd column): 2
    * Level1: 1 => gene
    * level2: 1 => mRNA
    * level3: 0 => 
    * unknown: 0 => 
=> Version of the Bioperl GFF parser selected by AGAT: 3
********************************************************************************
*                               - End parsing -                                *
*                              done in 2 seconds                               *
********************************************************************************

********************************************************************************
*                               - Start checks -                               *
********************************************************************************
---------------------------- Check1: feature types -----------------------------
----------------------------------- ontology -----------------------------------
All feature types in agreement with the Ontology.
------------------------------------- agat -------------------------------------
AGAT can deal with all the encountered feature types (3rd column)
------------------------------ done in 0 seconds -------------------------------

------------------------------ Check2: duplicates ------------------------------
None found
------------------------------ done in 0 seconds -------------------------------

-------------------------- Check3: sequential bucket ---------------------------
Nothing to check as sequential bucket!
------------------------------ done in 0 seconds -------------------------------

--------------------------- Check4: l2 linked to l3 ----------------------------
No problem found
------------------------------ done in 0 seconds -------------------------------

--------------------------- Check5: l1 linked to l2 ----------------------------
No problem found
------------------------------ done in 0 seconds -------------------------------

--------------------------- Check6: remove orphan l1 ---------------------------
We remove only those not supposed to be orphan
None found
------------------------------ done in 0 seconds -------------------------------

------------------------- Check7: all level3 locations -------------------------
------------------------------ done in 0 seconds -------------------------------

------------------------------ Check8: check cds -------------------------------
No problem found
------------------------------ done in 0 seconds -------------------------------

----------------------------- Check9: check exons ------------------------------
No exons created
No exons locations modified
No supernumerary exons removed
No level2 locations modified
------------------------------ done in 0 seconds -------------------------------

----------------------------- Check10: check utrs ------------------------------
No UTRs created
No UTRs locations modified
No supernumerary UTRs removed
------------------------------ done in 0 seconds -------------------------------

------------------------ Check11: all level2 locations -------------------------
No problem found
------------------------------ done in 0 seconds -------------------------------

------------------------ Check12: all level1 locations -------------------------
No problem found
------------------------------ done in 0 seconds -------------------------------

---------------------- Check13: remove identical isoforms ----------------------
None found
------------------------------ done in 0 seconds -------------------------------
********************************************************************************
*                                - End checks -                                *
*                              done in 0 seconds                               *
********************************************************************************

=> OmniscientI total time: 2 seconds
b.gff3 GFF3 file parsed
There is 1 mrna
There is 1 gene
********************************************************************************
*                              - Start parsing -                               *
********************************************************************************
-------------------------- parse options and metadata --------------------------
=> Accessing the feature_levels YAML file
Using standard /export/III-data/wcmp_bioinformatics/db291g/miniconda3/envs/genomeAnnotationPipeline/lib/perl5/site_perl/auto/share/dist/AGAT/feature_levels.yaml file
=> Attribute used to group features when no Parent/ID relationship exists (i.e common tag):
    * locus_tag
    * gene_id
=> merge_loci option deactivated
=> Machine information:
    This script is being run by perl v5.32.1
    Bioperl location being used: /export/III-data/wcmp_bioinformatics/db291g/miniconda3/envs/genomeAnnotationPipeline/lib/perl5/site_perl/Bio/
    Operating system being used: linux 
=> Accessing Ontology
    No ontology accessible from the gff file header!
    We use the SOFA ontology distributed with AGAT:
        /export/III-data/wcmp_bioinformatics/db291g/miniconda3/envs/genomeAnnotationPipeline/lib/perl5/site_perl/auto/share/dist/AGAT/so.obo
    Read ontology /export/III-data/wcmp_bioinformatics/db291g/miniconda3/envs/genomeAnnotationPipeline/lib/perl5/site_perl/auto/share/dist/AGAT/so.obo:
        4 root terms, and 2596 total terms, and 1516 leaf terms
    Filtering ontology:
        We found 1861 terms that are sequence_feature or is_a child of it.
--------------------------------- parsing file ---------------------------------
=> Number of line in file: 2
=> Number of comment lines: 0
=> Fasta included: No
=> Number of features lines: 2
=> Number of feature type (3rd column): 2
    * Level1: 1 => gene
    * level2: 1 => mRNA
    * level3: 0 => 
    * unknown: 0 => 
=> Version of the Bioperl GFF parser selected by AGAT: 3
********************************************************************************
*                               - End parsing -                                *
*                              done in 2 seconds                               *
********************************************************************************

********************************************************************************
*                               - Start checks -                               *
********************************************************************************
---------------------------- Check1: feature types -----------------------------
----------------------------------- ontology -----------------------------------
All feature types in agreement with the Ontology.
------------------------------------- agat -------------------------------------
AGAT can deal with all the encountered feature types (3rd column)
------------------------------ done in 0 seconds -------------------------------

------------------------------ Check2: duplicates ------------------------------
None found
------------------------------ done in 0 seconds -------------------------------

-------------------------- Check3: sequential bucket ---------------------------
Nothing to check as sequential bucket!
------------------------------ done in 0 seconds -------------------------------

--------------------------- Check4: l2 linked to l3 ----------------------------
No problem found
------------------------------ done in 0 seconds -------------------------------

--------------------------- Check5: l1 linked to l2 ----------------------------
No problem found
------------------------------ done in 0 seconds -------------------------------

--------------------------- Check6: remove orphan l1 ---------------------------
We remove only those not supposed to be orphan
None found
------------------------------ done in 0 seconds -------------------------------

------------------------- Check7: all level3 locations -------------------------
------------------------------ done in 0 seconds -------------------------------

------------------------------ Check8: check cds -------------------------------
No problem found
------------------------------ done in 0 seconds -------------------------------

----------------------------- Check9: check exons ------------------------------
No exons created
No exons locations modified
No supernumerary exons removed
No level2 locations modified
------------------------------ done in 0 seconds -------------------------------

----------------------------- Check10: check utrs ------------------------------
No UTRs created
No UTRs locations modified
No supernumerary UTRs removed
------------------------------ done in 0 seconds -------------------------------

------------------------ Check11: all level2 locations -------------------------
No problem found
------------------------------ done in 0 seconds -------------------------------

------------------------ Check12: all level1 locations -------------------------
No problem found
------------------------------ done in 0 seconds -------------------------------

---------------------- Check13: remove identical isoforms ----------------------
None found
------------------------------ done in 0 seconds -------------------------------
********************************************************************************
*                                - End checks -                                *
*                              done in 0 seconds                               *
********************************************************************************

=> OmniscientI total time: 2 seconds
g.gff3 GFF3 file parsed
There is 1 gene
There is 1 mrna

Total raw data of files together:
There is 2 mrna
There is 2 gene

Now merging overlaping loci, and removing identical isoforms
Use of uninitialized value $nb_cases in concatenation (.) or string at /export/projects/III-data/wcmp_bioinformatics/db291g/miniconda3/envs/genomeAnnotationPipeline/bin/agat_sp_merge_annotations.pl line 77.
 overlapping cases found. For each case 2 loci have been merged within a single locus

final result:
There is 2 mrna
There is 2 gene
Formating output to GFF3
Juke34 commented 1 year ago

Thank you for reporting it Right AGAT is supposed to removes identical isoforms. You can modify this behaviour in the config.yaml (agat config --expose). But there is an obvious problem here. I will have a look!

Juke34 commented 1 year ago

Right it is because the overlap is checked at CDS level, if no CDS we use the exons. Here there is none. I could add if there is no level3 feature (CDS, exon, etc) test equality at level2 (ncRNA, mRNA, etc), then level1 (gene, match, etc).