broadinstitute / Drop-seq

Java tools for analyzing Drop-seq data
MIT License
119 stars 34 forks source link

How to output gene_ids in DGE matrix instead of gene symbols/names #53

Closed JulianSpagnuolo closed 5 years ago

JulianSpagnuolo commented 5 years ago

I would prefer that the DGE matrix output in the last step by DigitalExpression returned a matrix of counts that used the ensembl gene/tx IDs or Entrez IDs (depending on where I sourced by gtf from). It isn't so clear from the documentation whether this is something I can change.

jamesnemesh commented 5 years ago

This isn’t an option. To do this, we’d need to change the tagging step to use the alternative IDs instead.

-Jim

On Nov 2, 2018, at 12:21 PM, Julian Spagnuolo notifications@github.com wrote:

I would prefer that the DGE matrix output in the last step by DigitalExpression returned a matrix of counts that used the ensembl gene/tx IDs or Entrez IDs (depending on where I sourced by gtf from). It isn't so clear from the documentation whether this is something I can change.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/Drop-seq/issues/53, or mute the thread https://github.com/notifications/unsubscribe-auth/AEWbpwFhiACRseYyIxOTgt6JMC00ro0cks5urHEIgaJpZM4YL71V.

grst commented 5 years ago

Hi @jamesnemesh, I'd also be interested in this.

As a workaround, could we just hack the GTF file to contain ENSG in the gene_name field, or would that lead to wrong results?

I.e. change

gene_id "ENSG00000223972"; gene_version "5"; gene_name "DDX11L1";

to

gene_id "ENSG00000223972"; gene_version "5"; gene_name "ENSG00000223972";

Best, Gregor

CC @Hoohm

JulianSpagnuolo commented 5 years ago

Hi all,

After some consideration and playing around with the GTF files (I’m using the latest Ensemble release for the human genome), I noticed that there are multiple gene_id’s for some gene_names, some of these are deprecated gene_ids, others are duplicates for other reasons (?). But since they are only a small few (at least in my case) they are easy enough to go through and manually annotate with the gene_id. So for the moment, I think the standard solution offered by James & co. is the safest.

This is only an issue because it is easier/safer to retrieve functional annotations form the ID vs a symbol, which is often a synonym or another non-stable identifier. I suffered this nightmare with the GeCKo v2 crispr gRNA library which is full of deprecated gene-symbols.

James - thanks for the toolset!

Kind regards, Julian

On 12 Nov 2018, at 15:58, Gregor Sturm notifications@github.com wrote:

Hi @jamesnemesh https://github.com/jamesnemesh, I'd also be interested in this.

As a workaround, could we just hack the GTF file to contain ENSG in the gene_name field, or would that lead to wrong results?

I.e. change

gene_id "ENSG00000223972"; gene_version "5"; gene_name "DDX11L1"; to

gene_id "ENSG00000223972"; gene_version "5"; gene_name "ENSG00000223972"; Best, Gregor

CC @Hoohm https://github.com/Hoohm — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/Drop-seq/issues/53#issuecomment-437911730, or mute the thread https://github.com/notifications/unsubscribe-auth/AZrpPBtDTTtHPA_CmUjNQIy8cHMSUvTqks5uuYyigaJpZM4YL71V.

jamesnemesh commented 5 years ago

This is an annoying problem. I think the intention of the more biologist focused people was the hope that gene symbols would be more comparable across minor changes to GTF versions than ensemble IDs, which was my first choice for output for DGEs. It’s pretty annoying to have a gene symbol that is the same on every chromosome (and there are a handful of those) that get removed, so there are tradeoffs.

I’ll keep this in mind if I have time to go back to the tagger to allow use of another field during tagging so a user gets gene symbol by default, but can leverage another identifier.

-Jim

On Nov 12, 2018, at 10:06 AM, Julian Spagnuolo notifications@github.com wrote:

Hi all,

After some consideration and playing around with the GTF files (I’m using the latest Ensemble release for the human genome), I noticed that there are multiple gene_id’s for some gene_names, some of these are deprecated gene_ids, others are duplicates for other reasons (?). But since they are only a small few (at least in my case) they are easy enough to go through and manually annotate with the gene_id. So for the moment, I think the standard solution offered by James & co. is the safest.

This is only an issue because it is easier/safer to retrieve functional annotations form the ID vs a symbol, which is often a synonym or another non-stable identifier. I suffered this nightmare with the GeCKo v2 crispr gRNA library which is full of deprecated gene-symbols.

James - thanks for the toolset!

Kind regards, Julian

Julian Spagnuolo, PhD.

Basel University Hospital Department of Biomedicine Experimental Immunology Hebelstrasse 20 4031 Basel Switzerland email: julian.spagnuolo@unibas.ch Tel. +41 61 265 23 22 Fax +41 61 265 23 50

On 12 Nov 2018, at 15:58, Gregor Sturm notifications@github.com wrote:

Hi @jamesnemesh https://github.com/jamesnemesh, I'd also be interested in this.

As a workaround, could we just hack the GTF file to contain ENSG in the gene_name field, or would that lead to wrong results?

I.e. change

gene_id "ENSG00000223972"; gene_version "5"; gene_name "DDX11L1"; to

gene_id "ENSG00000223972"; gene_version "5"; gene_name "ENSG00000223972"; Best, Gregor

CC @Hoohm https://github.com/Hoohm — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/Drop-seq/issues/53#issuecomment-437911730, or mute the thread https://github.com/notifications/unsubscribe-auth/AZrpPBtDTTtHPA_CmUjNQIy8cHMSUvTqks5uuYyigaJpZM4YL71V.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/Drop-seq/issues/53#issuecomment-437914447, or mute the thread https://github.com/notifications/unsubscribe-auth/AEWbp8xrggP7xwYUP7R35yl8zUWJG7F0ks5uuY6AgaJpZM4YL71V.