Keep records based on canonical transcript ID

jahemker commented 8 months ago


I have a D.melanogaster gff3 from Ensembl and I've extracted the transcript ids of the canonical transcripts for each protein coding gene. I would like to filter (keep) all entries in the gff3 that pertain to these canonical transcripts.

Based on my understanding it seems that agat_sp_filter_feature_from_keep_list.pl would work for this, however I have not been able to successfully filter the gff3. The output is always 0 records kept. I assume I am misunderstanding how this command works. Does AGAT have this functionality? Thank you!

gff3 from here: https://ftp.ensembl.org/pub/release-110/gff3/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.46.110.chr.gff3.gz

head of protein_coding_canonical_transcripts.txt


Output from report.txt

usage: /home/.../miniconda3/envs/genometools/bin/agat_sp_filter_feature_from_keep_list.pl --gff ../Drosophila_melanogaster.BDGP6.46.110.chr.gff3 --keep_list ../protein_coding_canonical_transcripts.txt --output Drosophila_melanogaster.BDGP6.46.110.protein_coding_canonical.gff3
We will keep the records that have all features sharing the value of the ID attribute with the keep list.
The keep list contains 13981 uniq IDs
0 records kept!

stdout when running the command

Using standard /home/.../miniconda3/envs/genometools/lib/perl5/site_perl/auto/share/dist/AGAT/agat_config.yaml file
10/10/2023 at 16h43m09s
usage: /home/.../miniconda3/envs/genometools/bin/agat_sp_filter_feature_from_keep_list.pl --gff ../Drosophila_melanogaster.BDGP6.46.110.chr.gff3 --keep_list ../protein_coding_canonical_transcripts.txt --output Drosophila_melanogaster.BDGP6.46.110.protein_coding_canonical.gff3
We will keep the records that have all features sharing the value of the ID attribute with the keep list.
The keep list contains 13981 uniq IDs

0 records kept!
Juke34 commented 4 weeks ago

Sorry your message stayed under my radar.

Your usage is good it is just the ID defined that are wrong. Indeed if you look in your file you will see e.g. for FBtr0070008 that the ID is defined as transcript:FBtr0070008:

X   FlyBase gene    20170222    20171526    .   +   .   ID=gene:FBgn0031094;Name=CG9578;biotype=protein_coding;gene_id=FBgn0031094;logic_name=flybase
X   FlyBase mRNA    20170222    20171526    .   +   .   ID=transcript:FBtr0070008;Parent=gene:FBgn0031094;Name=CG9578-RA;biotype=protein_coding;tag=Ensembl_canonical;transcript_id=FBtr0070008
X   FlyBase five_prime_UTR  20170222    20170348    .   +   .   Parent=transcript:FBtr0070008
X   FlyBase exon    20170222    20170363    .   +   .   Parent=transcript:FBtr0070008;Name=FBtr0070008-E1;constitutive=1;ensembl_end_phase=0;ensembl_phase=-1;exon_id=FBtr0070008-E1;rank=1
X   FlyBase CDS 20170349    20170363    .   +   0   ID=CDS:FBpp0070007;Parent=transcript:FBtr0070008;protein_id=FBpp0070007
X   FlyBase exon    20170424    20170758    .   +   .   Parent=transcript:FBtr0070008;Name=FBtr0070008-E2;constitutive=1;ensembl_end_phase=2;ensembl_phase=0;exon_id=FBtr0070008-E2;rank=2
X   FlyBase CDS 20170424    20170758    .   +   0   ID=CDS:FBpp0070007;Parent=transcript:FBtr0070008;protein_id=FBpp0070007
X   FlyBase exon    20170846    20171065    .   +   .   Parent=transcript:FBtr0070008;Name=FBtr0070008-E3;constitutive=1;ensembl_end_phase=0;ensembl_phase=2;exon_id=FBtr0070008-E3;rank=3
X   FlyBase CDS 20170846    20171065    .   +   1   ID=CDS:FBpp0070007;Parent=transcript:FBtr0070008;protein_id=FBpp0070007
X   FlyBase CDS 20171130    20171378    .   +   0   ID=CDS:FBpp0070007;Parent=transcript:FBtr0070008;protein_id=FBpp0070007
X   FlyBase exon    20171130    20171526    .   +   .   sfdParent=transcript:FBtr0070008;Name=FBtr0070008-E4;constitutive=1;ensembl_end_phase=-1;ensembl_phase=0;exon_id=FBtr0070008-E4;rank=4
X   FlyBase three_prime_UTR 20171379    20171526    .df +   .   Parent=transcript:FBtr0070008

So you should replaceFBtr0070008 by transcript:FBtr0070008 in your protein_coding_canonical_transcripts.txt file.