Closed JulianSpagnuolo closed 5 years ago
This isn’t an option. To do this, we’d need to change the tagging step to use the alternative IDs instead.
-Jim
On Nov 2, 2018, at 12:21 PM, Julian Spagnuolo notifications@github.com wrote:
I would prefer that the DGE matrix output in the last step by DigitalExpression returned a matrix of counts that used the ensembl gene/tx IDs or Entrez IDs (depending on where I sourced by gtf from). It isn't so clear from the documentation whether this is something I can change.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/Drop-seq/issues/53, or mute the thread https://github.com/notifications/unsubscribe-auth/AEWbpwFhiACRseYyIxOTgt6JMC00ro0cks5urHEIgaJpZM4YL71V.
Hi @jamesnemesh, I'd also be interested in this.
As a workaround, could we just hack the GTF file to contain ENSG in the gene_name
field, or would that lead to wrong results?
I.e. change
gene_id "ENSG00000223972"; gene_version "5"; gene_name "DDX11L1";
to
gene_id "ENSG00000223972"; gene_version "5"; gene_name "ENSG00000223972";
Best, Gregor
CC @Hoohm
Hi all,
After some consideration and playing around with the GTF files (I’m using the latest Ensemble release for the human genome), I noticed that there are multiple gene_id’s for some gene_names, some of these are deprecated gene_ids, others are duplicates for other reasons (?). But since they are only a small few (at least in my case) they are easy enough to go through and manually annotate with the gene_id. So for the moment, I think the standard solution offered by James & co. is the safest.
This is only an issue because it is easier/safer to retrieve functional annotations form the ID vs a symbol, which is often a synonym or another non-stable identifier. I suffered this nightmare with the GeCKo v2 crispr gRNA library which is full of deprecated gene-symbols.
James - thanks for the toolset!
Kind regards, Julian
On 12 Nov 2018, at 15:58, Gregor Sturm notifications@github.com wrote:
Hi @jamesnemesh https://github.com/jamesnemesh, I'd also be interested in this.
As a workaround, could we just hack the GTF file to contain ENSG in the gene_name field, or would that lead to wrong results?
I.e. change
gene_id "ENSG00000223972"; gene_version "5"; gene_name "DDX11L1"; to
gene_id "ENSG00000223972"; gene_version "5"; gene_name "ENSG00000223972"; Best, Gregor
CC @Hoohm https://github.com/Hoohm — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/Drop-seq/issues/53#issuecomment-437911730, or mute the thread https://github.com/notifications/unsubscribe-auth/AZrpPBtDTTtHPA_CmUjNQIy8cHMSUvTqks5uuYyigaJpZM4YL71V.
This is an annoying problem. I think the intention of the more biologist focused people was the hope that gene symbols would be more comparable across minor changes to GTF versions than ensemble IDs, which was my first choice for output for DGEs. It’s pretty annoying to have a gene symbol that is the same on every chromosome (and there are a handful of those) that get removed, so there are tradeoffs.
I’ll keep this in mind if I have time to go back to the tagger to allow use of another field during tagging so a user gets gene symbol by default, but can leverage another identifier.
-Jim
On Nov 12, 2018, at 10:06 AM, Julian Spagnuolo notifications@github.com wrote:
Hi all,
After some consideration and playing around with the GTF files (I’m using the latest Ensemble release for the human genome), I noticed that there are multiple gene_id’s for some gene_names, some of these are deprecated gene_ids, others are duplicates for other reasons (?). But since they are only a small few (at least in my case) they are easy enough to go through and manually annotate with the gene_id. So for the moment, I think the standard solution offered by James & co. is the safest.
This is only an issue because it is easier/safer to retrieve functional annotations form the ID vs a symbol, which is often a synonym or another non-stable identifier. I suffered this nightmare with the GeCKo v2 crispr gRNA library which is full of deprecated gene-symbols.
James - thanks for the toolset!
Kind regards, Julian
Julian Spagnuolo, PhD.
Basel University Hospital Department of Biomedicine Experimental Immunology Hebelstrasse 20 4031 Basel Switzerland email: julian.spagnuolo@unibas.ch Tel. +41 61 265 23 22 Fax +41 61 265 23 50
On 12 Nov 2018, at 15:58, Gregor Sturm notifications@github.com wrote:
Hi @jamesnemesh https://github.com/jamesnemesh, I'd also be interested in this.
As a workaround, could we just hack the GTF file to contain ENSG in the gene_name field, or would that lead to wrong results?
I.e. change
gene_id "ENSG00000223972"; gene_version "5"; gene_name "DDX11L1"; to
gene_id "ENSG00000223972"; gene_version "5"; gene_name "ENSG00000223972"; Best, Gregor
CC @Hoohm https://github.com/Hoohm — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/Drop-seq/issues/53#issuecomment-437911730, or mute the thread https://github.com/notifications/unsubscribe-auth/AZrpPBtDTTtHPA_CmUjNQIy8cHMSUvTqks5uuYyigaJpZM4YL71V.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/Drop-seq/issues/53#issuecomment-437914447, or mute the thread https://github.com/notifications/unsubscribe-auth/AEWbp8xrggP7xwYUP7R35yl8zUWJG7F0ks5uuY6AgaJpZM4YL71V.
I would prefer that the DGE matrix output in the last step by DigitalExpression returned a matrix of counts that used the ensembl gene/tx IDs or Entrez IDs (depending on where I sourced by gtf from). It isn't so clear from the documentation whether this is something I can change.