NBISweden / AGAT

Another Gtf/Gff Analysis Toolkit
GNU General Public License v3.0
432 stars 52 forks source link

Can AGAT `agat_convert_sp_gff2gtf.pl` add transcript_type tag when converting GFF3 to GTF #398

Closed Rohit-Satyam closed 5 months ago

Rohit-Satyam commented 10 months ago

Hi

I have been using agat for two years for now and it has worked wonders. I have recently discovered that a new RNASeq QC tool that requires a feature collapsed GTF.

NOTE: This tool requires that the provided GTF be collapsed in such a way that there are no overlapping transcripts on the same strand and that each gene have a single transcript whose id matches the parent gene id. This is not a transcript-quantification method. Readcounts and coverage are made towards exons and genes only if all aligned segments of a read fully align to exons of a gene, but keep in mind that coverage may be counted towards multiple transcripts (and its exons) if these criteria are met. Beyond this, no attempt will be made to disambiguate which transcript a read belongs to. You can collapse an existing GTF using the GTEx collapse annotation script.

This GTEx collapse annotation script however requires transcript_type and gene_type attribute to be present in GTF. For GENCODE files it's not a problem but for non-model organisms (eg. Plasmodium Falciparum) that have just GFF3 file and were converted to GTF using AGAT, this tag is missing. Instead my AGAT generate file has the following tags

 gene_ebi_biotype "protein_coding"; original_biotype "mrna"
Pf3D7_09_v3 VEuPathDB   transcript  29797   31157   .   -   .   gene_id "PF3D7_0900200"; transcript_id "PF3D7_0900200.1"; ID "PF3D7_0900200.1"; Parent "PF3D7_0900200"; description "rifin"; gene_ebi_biotype "protein_coding"; original_biotype "mrna";

Can you develop a utility in agat that can produce such feature collapsed GTF files for use with rnaseqc or maybe add the missing tags to GTF if a GTF file is given?

Juke34 commented 9 months ago

transcript_type and gene_type are attributes made by GENCODE. If you know on what basis those types can be deduce, then you should be able to add specific attributes to choose features using e.g. agat_sp_manage_attributes.pl or agat_sq_add_attributes_from_tsv.pl

Rohit-Satyam commented 8 months ago

Hi @Juke34. Thanks for the response. I got away with using sed for the time being after receiving the following email from VEuPathDB

Dear Rohit

Sorry for the delay.  I will pass along your request for GTF file format in our download. Thank you for the suggestion.  I asked our colleague at EBI about the tags and here is his comment.  

The biotypes usually use the same biotype name for a gene and transcript (e.g. “protein_coding”), while a more specific biotype (used in column 3) would have different names (e.g. “protein_coding_gene” and “mRNA”). That is why the transcript type is named “ebi_gene_type” and not “ebi_transcript_type”.
If you are ok with using the EBI generic terms, then yes you can use the conversion suggested. But if you want specific (SO) biotypes names, I think you should use the biotype from column 3 instead.

I hope this information is helpful.  

All the best,
Susanne

Susanne Warrenfeltz, PhD

Scientific Outreach Specialist

VEuPathDB, University of Georgia
sed -i 's/gene_ebi_biotype/transcript_type/g'  PlasmoDB-64_Pfalciparum3D7.gtf
sed -i 's/ebi_biotype/gene_type/g'  PlasmoDB-64_Pfalciparum3D7.gtf