NBISweden / AGAT

Another Gtf/Gff Analysis Toolkit
GNU General Public License v3.0
431 stars 52 forks source link

agat_sp_fix_overlaping_genes.pl says no overlapping genes found #440

Closed Rohit-Satyam closed 3 months ago

Rohit-Satyam commented 3 months ago

Describe the bug I was wondering if agat_sp_fix_overlaping_genes.pl performs similar function as collapse_annotation.py. If so when I run it on gencode file, I observe that the tool says:No gene overlaping with different name has been found !.

(agat) subudhak@KW61216$ agat_sp_fix_overlaping_genes.pl -f gffs/gencode.v45.primary_assembly.annotation.gtf -o temp_agat.gtf 
Using standard /home/subudhak/miniconda3/envs/agat/lib/perl5/site_perl/auto/share/dist/AGAT/agat_config.yaml file
File temp_agat.gtf already exist.
(agat) subudhak@KW61216:~/Downloads/check_gtf_collapse$ agat_sp_fix_overlaping_genes.pl -f gffs/gencode.v45.primary_assembly.annotation.gtf -o temp_agat_gencode.gtf 
Using standard /home/subudhak/miniconda3/envs/agat/lib/perl5/site_perl/auto/share/dist/AGAT/agat_config.yaml file
Parse file gffs/gencode.v45.primary_assembly.annotation.gtf

 ------------------------------------------------------------------------------
|   Another GFF Analysis Toolkit (AGAT) - Version: v1.3.0                      |
|   https://github.com/NBISweden/AGAT                                          |
|   National Bioinformatics Infrastructure Sweden (NBIS) - www.nbis.se         |
 ------------------------------------------------------------------------------

                          ------ Start parsing ------                           
-------------------------- parse options and metadata --------------------------
=> Accessing the feature_levels YAML file
Using standard /home/subudhak/miniconda3/envs/agat/lib/perl5/site_perl/auto/share/dist/AGAT/feature_levels.yaml file
=> Attribute used to group features when no Parent/ID relationship exists (i.e common tag):
    * locus_tag
    * gene_id
=> merge_loci option deactivated
=> Machine information:
    This script is being run by perl v5.32.1
    Bioperl location being used: /home/subudhak/miniconda3/envs/agat/lib/perl5/site_perl/Bio/
    Operating system being used: linux 
=> Accessing Ontology
    No ontology accessible from the gff file header!
    We use the SOFA ontology distributed with AGAT:
        /home/subudhak/miniconda3/envs/agat/lib/perl5/site_perl/auto/share/dist/AGAT/so.obo
    Read ontology /home/subudhak/miniconda3/envs/agat/lib/perl5/site_perl/auto/share/dist/AGAT/so.obo:
        4 root terms, and 2596 total terms, and 1516 leaf terms
    Filtering ontology:
        We found 1861 terms that are sequence_feature or is_a child of it.
--------------------------------- parsing file ---------------------------------
=> Number of line in file: 3428060
=> Number of comment lines: 5
=> Fasta included: No
=> Number of features lines: 3428055
=> Number of feature type (3rd column): 8
    * Level1: 1 => gene
    * level2: 1 => transcript
    * level3: 6 => exon start_codon Selenocysteine UTR stop_codon CDS
    * unknown: 0 => 
=> Version of the Bioperl GFF parser selected by AGAT: 2

                 ------ End parsing (done in 798 second) ------                 

                           ------ Start checks ------                           
---------------------------- Check1: feature types -----------------------------
----------------------------------- ontology -----------------------------------
All feature types in agreement with the Ontology.
------------------------------------- agat -------------------------------------
AGAT can deal with all the encountered feature types (3rd column)
------------------------------ done in 0 seconds -------------------------------

------------------------------ Check2: duplicates ------------------------------
None found
------------------------------ done in 0 seconds -------------------------------

-------------------------- Check3: sequential bucket ---------------------------
None found
------------------------------ done in 4 seconds -------------------------------

--------------------------- Check4: l2 linked to l3 ----------------------------
No problem found
------------------------------ done in 2 seconds -------------------------------

--------------------------- Check5: l1 linked to l2 ----------------------------
No problem found
------------------------------ done in 0 seconds -------------------------------

--------------------------- Check6: remove orphan l1 ---------------------------
We remove only those not supposed to be orphan
None found
------------------------------ done in 0 seconds -------------------------------

------------------------- Check7: all level3 locations -------------------------
------------------------------ done in 58 seconds ------------------------------

------------------------------ Check8: check cds -------------------------------
90413 CDS extended to include the stop_codon
986 CDS created to include the stop_codon that was on next exon
------------------------------ done in 5 seconds -------------------------------

----------------------------- Check9: check exons ------------------------------
No exons created
No exons locations modified
No supernumerary exons removed
No level2 locations modified
------------------------------ done in 33 seconds ------------------------------

----------------------------- Check10: check utrs ------------------------------
No UTRs created
88869 UTRs locations modified that were wrong
1321 UTRs removed that were supernumerary
------------------------------ done in 22 seconds ------------------------------

------------------------ Check11: all level2 locations -------------------------
No problem found
------------------------------ done in 39 seconds ------------------------------

------------------------ Check12: all level1 locations -------------------------
No problem found
------------------------------ done in 2 seconds -------------------------------

---------------------- Check13: remove identical isoforms ----------------------
Lets remove isoform ENST00000673044.1
Lets remove isoform ENST00000692155.1
Lets remove isoform ENST00000649161.1
Lets remove isoform ENST00000704744.1
Lets remove isoform ENST00000616975.5
Lets remove isoform ENST00000692795.1
Lets remove isoform ENST00000646699.1
Lets remove isoform ENST00000689570.1
Lets remove isoform ENST00000679908.1
Lets remove isoform ENST00000672699.1
Lets remove isoform ENST00000681786.1
Lets remove isoform ENST00000645462.1
Lets remove isoform ENST00000695372.1
Lets remove isoform ENST00000650987.1
Lets remove isoform ENST00000639086.1
Lets remove isoform ENST00000678026.1
Lets remove isoform ENST00000611770.5
Lets remove isoform ENST00000683298.1
Lets remove isoform ENST00000691462.1
Lets remove isoform ENST00000651842.1
Lets remove isoform ENST00000678269.1
Lets remove isoform ENST00000513750.6
Lets remove isoform ENST00000675870.1
Lets remove isoform ENST00000479889.2
Lets remove isoform ENST00000682694.1
Lets remove isoform ENST00000678578.1
Lets remove isoform ENST00000650107.1
Lets remove isoform ENST00000644515.1
Lets remove isoform ENST00000703568.1
Lets remove isoform ENST00000643692.1
Lets remove isoform ENST00000675428.1
Lets remove isoform ENST00000693090.1
Lets remove isoform ENST00000688656.1
Lets remove isoform ENST00000681489.1
Lets remove isoform ENST00000703564.1
Lets remove isoform ENST00000692380.1
Lets remove isoform ENST00000679526.1
Lets remove isoform ENST00000684208.1
Lets remove isoform ENST00000673975.1
Lets remove isoform ENST00000705253.1
Lets remove isoform ENST00000692585.1
Lets remove isoform ENST00000552002.7
Lets remove isoform ENST00000706714.1
Lets remove isoform ENST00000677604.1
Lets remove isoform ENST00000682551.1
Lets remove isoform ENST00000703226.1
Lets remove isoform ENST00000673324.1
Lets remove isoform ENST00000705985.1
Lets remove isoform ENST00000672287.2
Lets remove isoform ENST00000432560.6
50 identical isoforms removed
------------------------------ done in 55 seconds ------------------------------
                  ------ End checks (done in 220 second) ------                 

gffs/gencode.v45.primary_assembly.annotation.gtf file parsed
No gene overlaping with different name has been found !
Formating output to GFF3
END

General (please complete the following information):

Besides is there a way to make agat parse this file faster (just asking)?

Juke34 commented 3 months ago

Reading the code of collapse_annotation.py it sounds it is not at all to perform the same task. collapse_annotation.py collapse isoforms of a gene in a single "fake isoform", may be useful for mapping but not as an annotation for e.g. translation purpose.

While agat_sp_fix_overlaping_genes.pl is just to check when you have an annotation where you have a locus with several genes annotated while it should be a single gene (because CDS overlap and the gene is in the same strand), then a single gene is kept and all isoforms are attached to it (That type of case may occur when you copy past two annotation file in a single file instead to use a proper merge/complement script).

Juke34 commented 3 months ago

There's no way of making it faster at the moment. It would have to be reimplemented to parallelize it, but I doubt anyone will ever spend time to do that.