flatironinstitute / inferelator-prior

Gene regulatory network inference using DNA-binding motifs and chromatin accessibility data.
MIT License
9 stars 3 forks source link

inferelator_prior.network_from_motifs use gene name #11

Closed poorvam closed 2 years ago

poorvam commented 2 years ago

Hi,

inferelator_prior.network_from_motifs uses transcript id from GTF file when writing gene x TF matrix. Is there an option to use gene name from GTF file instead?

Thank you, Poorva

asistradition commented 2 years ago

The behavior for the GTF processor is to use gene_id from the attributes field as the gene identifier (it's extracted with a regular expression gene_id \"(.*?)\";). If that's giving the transcript ID, I'd actually consider that GTF to be malformed. GTF's already a pretty tough format because of the... less than ideal format spec. There won't be a lot of flexibility to work around malformed files as a result.

This is a sample of the loaded mm10 GTF file (as an example). The gene_name field is what will be used when writing out the final matrices.

           gene_name      start        end seqname strand        TSS
22534           Wbp5  136245080  136247139    chrX      +  136245080
17565          Rbm34  126947173  126971079    chr8      -  126971079
9347         Golga7b   42247578   42270348   chr19      +   42247578
1167   4930570G19Rik  156546753  156561746    chr3      -  156561746
9416          Gpr113   30193431   30205722    chr5      -   30205722
14701       Olfr1312  112042077  112043030    chr2      -  112043030
7556           Folh1   86718976   86775864    chr7      -   86775864
12790         Mir421   77427214  142539883   chr15      +   77427214
10664           Iqch   63421620   63602448    chr9      -   63602448
16687          Ppil1   29250835   29263971   chr17      -   29263971
poorvam commented 2 years ago

Thank you for your reply. But the gtf file i download from Gencode website (https://www.gencodegenes.org/mouse/release_M10.html) has not just gene but transcript information too (Select - Comprehensive gene annotation - Primary regions). Where did you download gtf file from?

Screen Shot 2022-05-04 at 1 40 18 PM
asistradition commented 2 years ago

I'm not sure what the issue is - I usually use the NCBI annotations, but looking at the GENCODE file:

chr1    HAVANA  gene    3073253 3074322 .   +   .   gene_id "ENSMUSG00000102693.1"; gene_type "TEC"; gene_status "KNOWN"; gene_name "4933401J01Rik"; level 2; havana_gene "OTTMUSG00000049935.1";
chr1    HAVANA  transcript  3073253 3074322 .   +   .   gene_id "ENSMUSG00000102693.1"; transcript_id "ENSMUST00000193812.1"; gene_type "TEC"; gene_status "KNOWN"; gene_name "4933401J01Rik"; transcript_type "TEC"; transcript_status "KNOWN"; transcript_name "4933401J01Rik-001"; level 2; transcript_support_level "NA"; tag "basic"; havana_gene "OTTMUSG00000049935.1"; havana_transcript "OTTMUST00000127109.1";
chr1    HAVANA  exon    3073253 3074322 .   +   .   gene_id "ENSMUSG00000102693.1"; transcript_id "ENSMUST00000193812.1"; gene_type "TEC"; gene_status "KNOWN"; gene_name "4933401J01Rik"; transcript_type "TEC"; transcript_status "KNOWN"; transcript_name "4933401J01Rik-001"; exon_number 1; exon_id "ENSMUSE00001343744.1"; level 2; transcript_support_level "NA"; tag "basic"; havana_gene "OTTMUSG00000049935.1"; havana_transcript "OTTMUST00000127109.1";

ENSMUSG00000102693.1 is a systematic identifier for gene 4933401J01Rik and appears to be correctly entered into the gene_id tag as it should be. The transcript ID is ENSMUST00000193812.1.

It may be worth building a translation table to go from ensembl systematic ID to another gene name (e.g. NCBI or common-name), although it's usually best to do that translation after you've done all the bioinformatics, in my experience (e.g. just translate figure labels as needed).