NBISweden / AGAT

Another Gtf/Gff Analysis Toolkit
GNU General Public License v3.0
432 stars 52 forks source link

Keep records based on canonical transcript ID #400

Closed jahemker closed 4 weeks ago

jahemker commented 8 months ago

Hello,

I have a D.melanogaster gff3 from Ensembl and I've extracted the transcript ids of the canonical transcripts for each protein coding gene. I would like to filter (keep) all entries in the gff3 that pertain to these canonical transcripts.

Based on my understanding it seems that agat_sp_filter_feature_from_keep_list.pl would work for this, however I have not been able to successfully filter the gff3. The output is always 0 records kept. I assume I am misunderstanding how this command works. Does AGAT have this functionality? Thank you!

gff3 from here: https://ftp.ensembl.org/pub/release-110/gff3/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.46.110.chr.gff3.gz

head of protein_coding_canonical_transcripts.txt

FBtr0006151
FBtr0070002
FBtr0070003
FBtr0070006
FBtr0070007
FBtr0070008
FBtr0070026
FBtr0070029
FBtr0070032
FBtr0070035

Output from report.txt

usage: /home/.../miniconda3/envs/genometools/bin/agat_sp_filter_feature_from_keep_list.pl --gff ../Drosophila_melanogaster.BDGP6.46.110.chr.gff3 --keep_list ../protein_coding_canonical_transcripts.txt --output Drosophila_melanogaster.BDGP6.46.110.protein_coding_canonical.gff3
We will keep the records that have all features sharing the value of the ID attribute with the keep list.
The keep list contains 13981 uniq IDs
0 records kept!

stdout when running the command

Using standard /home/.../miniconda3/envs/genometools/lib/perl5/site_perl/auto/share/dist/AGAT/agat_config.yaml file
10/10/2023 at 16h43m09s
usage: /home/.../miniconda3/envs/genometools/bin/agat_sp_filter_feature_from_keep_list.pl --gff ../Drosophila_melanogaster.BDGP6.46.110.chr.gff3 --keep_list ../protein_coding_canonical_transcripts.txt --output Drosophila_melanogaster.BDGP6.46.110.protein_coding_canonical.gff3
We will keep the records that have all features sharing the value of the ID attribute with the keep list.
The keep list contains 13981 uniq IDs

 ------------------------------------------------------------------------------
|   Another GFF Analysis Toolkit (AGAT) - Version: v1.2.0                      |
|   https://github.com/NBISweden/AGAT                                          |
|   National Bioinformatics Infrastructure Sweden (NBIS) - www.nbis.se         |
 ------------------------------------------------------------------------------

                          ------ Start parsing ------                           
-------------------------- parse options and metadata --------------------------
=> Accessing the feature_levels YAML file
Using standard /home/.../miniconda3/envs/genometools/lib/perl5/site_perl/auto/share/dist/AGAT/feature_levels.yaml file
=> Attribute used to group features when no Parent/ID relationship exists (i.e common tag):
    * locus_tag
    * gene_id
=> merge_loci option deactivated
=> Machine information:
    This script is being run by perl v5.32.1
    Bioperl location being used: /home/.../miniconda3/envs/genometools/lib/perl5/site_perl/Bio/
    Operating system being used: linux 
=> Accessing Ontology
    No ontology accessible from the gff file header!
    We use the SOFA ontology distributed with AGAT:
        /home/.../miniconda3/envs/genometools/lib/perl5/site_perl/auto/share/dist/AGAT/so.obo
    Read ontology /home/.../miniconda3/envs/genometools/lib/perl5/site_perl/auto/share/dist/AGAT/so.obo:
        4 root terms, and 2596 total terms, and 1516 leaf terms
    Filtering ontology:
        We found 1861 terms that are sequence_feature or is_a child of it.
--------------------------------- parsing file ---------------------------------
=> Number of line in file: 530401
=> Number of comment lines: 24249
=> Fasta included: No
=> Number of features lines: 506152
=> Number of feature type (3rd column): 19
    * Level1: 6 => ncRNA_gene transposable_element_gene gene transposable_element pseudogene region
    * level2: 9 => snoRNA snRNA rRNA pseudogenic_transcript pre_miRNA mRNA ncRNA tRNA miRNA
    * level3: 4 => three_prime_UTR CDS exon five_prime_UTR
    * unknown: 0 => 
=> Version of the Bioperl GFF parser selected by AGAT: 3
                  ------ End parsing (done in 58 second) ------                 

                           ------ Start checks ------                           
---------------------------- Check1: feature types -----------------------------
----------------------------------- ontology -----------------------------------
All feature types in agreement with the Ontology.
------------------------------------- agat -------------------------------------
AGAT can deal with all the encountered feature types (3rd column)
------------------------------ done in 0 seconds -------------------------------

------------------------------ Check2: duplicates ------------------------------
None found
------------------------------ done in 0 seconds -------------------------------

-------------------------- Check3: sequential bucket ---------------------------
None found
------------------------------ done in 2 seconds -------------------------------

--------------------------- Check4: l2 linked to l3 ----------------------------
5891 cases fixed where L3 features have parent feature(s) missing
------------------------------ done in 1 seconds -------------------------------

--------------------------- Check5: l1 linked to l2 ----------------------------
No problem found
------------------------------ done in 0 seconds -------------------------------

--------------------------- Check6: remove orphan l1 ---------------------------
We remove only those not supposed to be orphan
5891 cases removed where L1 features do not have children (while they are suposed to have children).
------------------------------ done in 0 seconds -------------------------------

------------------------- Check7: all level3 locations -------------------------
------------------------------ done in 10 seconds ------------------------------

------------------------------ Check8: check cds -------------------------------
No problem found
------------------------------ done in 0 seconds -------------------------------

----------------------------- Check9: check exons ------------------------------
No exons created
No exons locations modified
No supernumerary exons removed
No level2 locations modified
------------------------------ done in 6 seconds -------------------------------

----------------------------- Check10: check utrs ------------------------------
No UTRs created
No UTRs locations modified
No supernumerary UTRs removed
------------------------------ done in 4 seconds -------------------------------

------------------------ Check11: all level2 locations -------------------------
No problem found
------------------------------ done in 7 seconds -------------------------------

------------------------ Check12: all level1 locations -------------------------
No problem found
------------------------------ done in 0 seconds -------------------------------

---------------------- Check13: remove identical isoforms ----------------------
Lets remove isoform transcript:FBtr0307760
Lets remove isoform transcript:FBtr0084084
Lets remove isoform transcript:FBtr0084082
Lets remove isoform transcript:FBtr0481732
Lets remove isoform transcript:FBtr0479842
5 identical isoforms removed
------------------------------ done in 10 seconds ------------------------------
                  ------ End checks (done in 40 second) ------                  

Parsing Finished
Formating output to GFF3
0 records kept!
Juke34 commented 4 weeks ago

Sorry your message stayed under my radar.

Your usage is good it is just the ID defined that are wrong. Indeed if you look in your file you will see e.g. for FBtr0070008 that the ID is defined as transcript:FBtr0070008:

###
X   FlyBase gene    20170222    20171526    .   +   .   ID=gene:FBgn0031094;Name=CG9578;biotype=protein_coding;gene_id=FBgn0031094;logic_name=flybase
X   FlyBase mRNA    20170222    20171526    .   +   .   ID=transcript:FBtr0070008;Parent=gene:FBgn0031094;Name=CG9578-RA;biotype=protein_coding;tag=Ensembl_canonical;transcript_id=FBtr0070008
X   FlyBase five_prime_UTR  20170222    20170348    .   +   .   Parent=transcript:FBtr0070008
X   FlyBase exon    20170222    20170363    .   +   .   Parent=transcript:FBtr0070008;Name=FBtr0070008-E1;constitutive=1;ensembl_end_phase=0;ensembl_phase=-1;exon_id=FBtr0070008-E1;rank=1
X   FlyBase CDS 20170349    20170363    .   +   0   ID=CDS:FBpp0070007;Parent=transcript:FBtr0070008;protein_id=FBpp0070007
X   FlyBase exon    20170424    20170758    .   +   .   Parent=transcript:FBtr0070008;Name=FBtr0070008-E2;constitutive=1;ensembl_end_phase=2;ensembl_phase=0;exon_id=FBtr0070008-E2;rank=2
X   FlyBase CDS 20170424    20170758    .   +   0   ID=CDS:FBpp0070007;Parent=transcript:FBtr0070008;protein_id=FBpp0070007
X   FlyBase exon    20170846    20171065    .   +   .   Parent=transcript:FBtr0070008;Name=FBtr0070008-E3;constitutive=1;ensembl_end_phase=0;ensembl_phase=2;exon_id=FBtr0070008-E3;rank=3
X   FlyBase CDS 20170846    20171065    .   +   1   ID=CDS:FBpp0070007;Parent=transcript:FBtr0070008;protein_id=FBpp0070007
X   FlyBase CDS 20171130    20171378    .   +   0   ID=CDS:FBpp0070007;Parent=transcript:FBtr0070008;protein_id=FBpp0070007
X   FlyBase exon    20171130    20171526    .   +   .   sfdParent=transcript:FBtr0070008;Name=FBtr0070008-E4;constitutive=1;ensembl_end_phase=-1;ensembl_phase=0;exon_id=FBtr0070008-E4;rank=4
X   FlyBase three_prime_UTR 20171379    20171526    .df +   .   Parent=transcript:FBtr0070008
###

So you should replaceFBtr0070008 by transcript:FBtr0070008 in your protein_coding_canonical_transcripts.txt file.