Closed poorvam closed 2 years ago
The behavior for the GTF processor is to use gene_id
from the attributes field as the gene identifier (it's extracted with a regular expression gene_id \"(.*?)\";
). If that's giving the transcript ID, I'd actually consider that GTF to be malformed. GTF's already a pretty tough format because of the... less than ideal format spec. There won't be a lot of flexibility to work around malformed files as a result.
This is a sample of the loaded mm10 GTF file (as an example). The gene_name
field is what will be used when writing out the final matrices.
gene_name start end seqname strand TSS
22534 Wbp5 136245080 136247139 chrX + 136245080
17565 Rbm34 126947173 126971079 chr8 - 126971079
9347 Golga7b 42247578 42270348 chr19 + 42247578
1167 4930570G19Rik 156546753 156561746 chr3 - 156561746
9416 Gpr113 30193431 30205722 chr5 - 30205722
14701 Olfr1312 112042077 112043030 chr2 - 112043030
7556 Folh1 86718976 86775864 chr7 - 86775864
12790 Mir421 77427214 142539883 chr15 + 77427214
10664 Iqch 63421620 63602448 chr9 - 63602448
16687 Ppil1 29250835 29263971 chr17 - 29263971
Thank you for your reply. But the gtf file i download from Gencode website (https://www.gencodegenes.org/mouse/release_M10.html) has not just gene but transcript information too (Select - Comprehensive gene annotation - Primary regions). Where did you download gtf file from?
I'm not sure what the issue is - I usually use the NCBI annotations, but looking at the GENCODE file:
chr1 HAVANA gene 3073253 3074322 . + . gene_id "ENSMUSG00000102693.1"; gene_type "TEC"; gene_status "KNOWN"; gene_name "4933401J01Rik"; level 2; havana_gene "OTTMUSG00000049935.1";
chr1 HAVANA transcript 3073253 3074322 . + . gene_id "ENSMUSG00000102693.1"; transcript_id "ENSMUST00000193812.1"; gene_type "TEC"; gene_status "KNOWN"; gene_name "4933401J01Rik"; transcript_type "TEC"; transcript_status "KNOWN"; transcript_name "4933401J01Rik-001"; level 2; transcript_support_level "NA"; tag "basic"; havana_gene "OTTMUSG00000049935.1"; havana_transcript "OTTMUST00000127109.1";
chr1 HAVANA exon 3073253 3074322 . + . gene_id "ENSMUSG00000102693.1"; transcript_id "ENSMUST00000193812.1"; gene_type "TEC"; gene_status "KNOWN"; gene_name "4933401J01Rik"; transcript_type "TEC"; transcript_status "KNOWN"; transcript_name "4933401J01Rik-001"; exon_number 1; exon_id "ENSMUSE00001343744.1"; level 2; transcript_support_level "NA"; tag "basic"; havana_gene "OTTMUSG00000049935.1"; havana_transcript "OTTMUST00000127109.1";
ENSMUSG00000102693.1
is a systematic identifier for gene 4933401J01Rik
and appears to be correctly entered into the gene_id
tag as it should be. The transcript ID is ENSMUST00000193812.1
.
It may be worth building a translation table to go from ensembl systematic ID to another gene name (e.g. NCBI or common-name), although it's usually best to do that translation after you've done all the bioinformatics, in my experience (e.g. just translate figure labels as needed).
Hi,
inferelator_prior.network_from_motifs uses transcript id from GTF file when writing gene x TF matrix. Is there an option to use gene name from GTF file instead?
Thank you, Poorva