DaehwanKimLab / hisat2

Graph-based alignment (Hierarchical Graph FM index)
GNU General Public License v3.0
475 stars 119 forks source link

extracting splice_sites and exons from gff3 files #294

Open eduardopdev opened 3 years ago

eduardopdev commented 3 years ago

Hello, I am a undergrad student working with rna-seq data and I use hisat2 in my pipeline but for one of the species I am working with I only have its annotation data in gff3 format and the scripts for splice sites and exons extraction don't work with gff3 files.

What would you recommend for extracting this information from a GFF3 file? Do you have any scripts that can help me with this problem?

parkchanhee commented 3 years ago

There are several ways to convert a gff3 to a gtf. gffread is a tool can handle a gff/gtf file. Please check the following link for more information. https://github.com/gpertea/gffread

eduardopdev commented 3 years ago

Thanks for the reply @parkchanhee :). I already tried gffread and agat but the output files are chopped. In your experience the output from gffread always have the same amount of data as the input? What i mean is, for any tuple (Chr, start, end, strand) in the input file was always the case that that tuple would be in the output file?

parkchanhee commented 3 years ago

@eduardopdev By default, gffread generates an output file as simple. It processes only basic attributes. You can use options to control it. (I'm not sure if other programs do the same way) Some features or attributes may not be converted to other formats, so after converting, the output file may have different size. hisat2_extract_exons.py and hisat2_extract_splice_sites.py only process  exon features which have a 'transcript_id' or a 'gene_id' attribute. I think the tuple will be preserved if a feature is not filtered.