Regarding CNN_H99 GTF file

varamhcu commented 2 years ago

Hello,

First of all, thank you very much for such wonderful code. It is very helpful for us.

We are working on Cryptococus neoformans (H99 strain) model species. Only the GFF file is provided. So, we converted it to GTF file using a custom python code. Here I am pasting a few header lines of the converted GTF file. Please check it and let us know if any changes are needed.

Head of the GTF file: CP003824.1 gmapidx gene 409610 410982 . + . gene_id "CNAG_01397"; transcript_id "CNAG_01397"; CP003824.1 gmapidx transcript 409610 410982 . + . gene_id "CNAG_01397"; transcript_id "CNAG_01397"; CP003824.1 gmapidx exon 409610 410982 . + . gene_id "CNAG_01397"; transcript_id "CNAG_01397"; CP003824.1 gmapidx gene 409610 410982 . + . gene_id "CNAG_01397-t26_1"; transcript_id "CNAG_01397-t26_1"; CP003824.1 gmapidx transcript 409610 410982 . + . gene_id "CNAG_01397-t26_1"; transcript_id "CNAG_01397-t26_1"; CP003824.1 gmapidx exon 409610 410982 . + . gene_id "CNAG_01397-t26_1"; transcript_id "CNAG_01397-t26_1"; CP003824.1 gmapidx gene 409610 410051 . + . gene_id "exon_CNAG_01397-t26_1-E1"; transcript_id "exon_CNAG_01397-t26_1-E1"; CP003824.1 gmapidx transcript 409610 410051 . + . gene_id "exon_CNAG_01397-t26_1-E1"; transcript_id "exon_CNAG_01397-t26_1-E1"; CP003824.1 gmapidx exon 409610 410051 . + . gene_id "exon_CNAG_01397-t26_1-E1"; transcript_id "exon_CNAG_01397-t26_1-E1"; CP003824.1 gmapidx gene 410113 410259 . + . gene_id "exon_CNAG_01397-t26_1-E2"; transcript_id "exon_CNAG_01397-t26_1-E2"; CP003824.1 gmapidx transcript 410113 410259 . + . gene_id "exon_CNAG_01397-t26_1-E2"; transcript_id "exon_CNAG_01397-t26_1-E2"; CP003824.1 gmapidx exon 410113 410259 . + . gene_id "exon_CNAG_01397-t26_1-E2"; transcript_id "exon_CNAG_01397-t26_1-E2";

Thank you

YumaIshigami commented 2 years ago

Your GFF to GTF conversion script seems to be inserting the "gene" and "transcript" features before every single "exon" feature, which makes the gtfbamcount.py recognize them as single exon genes without introns. In this gtfbamcount.py script, the GTF file needs to be sorted so that the "exon" features are in order and not interrupted by "transcript" features.

Please convert the gff file to gtf format by the gffread tool (which can be installed from conda), as like: gffread FungiDB-53_CneoformansH99.gff -T -o FungiDB-53_CneoformansH99.gtf I modified the script to accept GTF files with its minus strand exons sorted in ascending order. You should be able to obtain the count file by: python gtfbamcount.py sample1.bam FungiDB-53_CneoformansH99.gtf sample1.count.txt -m

I tested with public C. neoformans RNA-seq data and obtained the count file and applied it to irsseq.py. Please note that this script currently independently counts introns shared by two or more isoforms, so there will be overlaps that possess the same coordinate and junction counts that you might want to remove afterward.

varamhcu commented 2 years ago

The gtf file after processing through gffread worked.

Since it worked without any error, I am closing this issue.

Thank you very much for the replies.

YumaIshigami / irscalc

Regarding CNN_H99 GTF file #2