genome-nexus / genome-nexus-importer

Import data into MongoDB for use by https://github.com/genome-nexus/genome-nexus/
MIT License
4 stars 16 forks source link

Hotspots data not grch38 compatible #58

Closed pieterlukasse closed 2 years ago

pieterlukasse commented 2 years ago

Hotspots data is not ported to grch38 yet.

E.g. these two files still contain exactly the same transcript ids for both grch37 and grch38:

We need to update the grch38 version to contain the updated transcript ids.

E.g. for BRAF this would probably be ENST00000646891 instead of ENST00000288602. See also: https://ensembl.org/homo_sapiens/Transcript/Summary?t=ENST00000288602

pieterlukasse commented 2 years ago

@inodb, @zhx828, here are my latest findings:

Is this a good summary? If yes, where can we get this mapping file?

inodb commented 2 years ago

Thanks so much @pieterlukasse ! That looks correct to me

This should have the grch38 mappings you are looking for:

https://github.com/genome-nexus/genome-nexus-importer/blob/master/data/common_input/isoform_overrides_at_mskcc_grch38.txt

zhx828 commented 2 years ago

@pieterlukasse I only have the mapping for OncoKB ~700 genes. But I remember I went through the hotspot list using the grch38 override list and only MYD88 is affected. So using the file Ino pointed above should be ok.

pieterlukasse commented 2 years ago

@inodb I noticed this gene is duplicated in the mapping you shared: CDKN2A

ENST00000304494 CDKN2A  NM_000077.4
ENST00000579755 CDKN2A  NM_058195.3

The first transcript is the canonical in grch38, and is also returned when using mskcc as the isoformOverrideSource here: https://grch38.genomenexus.org/swagger-ui.html#!/ensembl45controller/fetchCanonicalEnsemblTranscriptByHugoSymbolGET, so I'm guessing we can just remove the second one, correct?

pieterlukasse commented 2 years ago

@inodb , @leexgh looking at the mapping file https://github.com/genome-nexus/genome-nexus-importer/blob/master/data/grch38_ensembl95/export/ensembl_biomart_canonical_transcripts_per_hgnc.txt, I noticed that some genes do not have a value in ensembl_canonical_gene, but do have ensembl transcript ids set for mskcc_canonical_transcript and uniprot_canonical_transcript. Can you please help me understand these cases?

image

inodb commented 2 years ago

@pieterlukasse:

I noticed this gene is duplicated in the mapping you shared: CDKN2A ENST00000304494 CDKN2A NM_000077.4 ENST00000579755 CDKN2A NM_058195.3 The first transcript is the canonical in grch38, and is also returned when using mskcc as the isoformOverrideSource here: https://grch38.genomenexus.org/swagger-ui.html#!/ensembl45controller/fetchCanonicalEnsemblTranscriptByHugoSymbolGET, so I'm guessing we can just remove the second one, correct?

Yeah this is sort of a bug. The original vcf2maf that Genome Nexus is based on has a way to assign two canonical transcripts (in case one of them is not found for that particular variant). In our case for supporting cBioPortal this is not great (in the sense that protein changes would differ when querying for CDKN2A), so we only choose one. So yes let's stick with just ENST00000304494

looking at the mapping file https://github.com/genome-nexus/genome-nexus-importer/blob/master/data/grch38_ensembl95/export/ensembl_biomart_canonical_transcripts_per_hgnc.txt, I noticed that some genes do not have a value in ensembl_canonical_gene, but do have ensembl transcript ids set for mskcc_canonical_transcript and uniprot_canonical_transcript. Can you please help me understand these cases?

Hmm this sounds like a bug that we should solve. Not sure how you can have an ensembl transcript without a gene? One of the examples you shared does seem to have an associated ENSG gene id: https://grch37.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000221972;r=3:133646992-133648656;t=ENST00000408895

pieterlukasse commented 2 years ago

@inodb thanks for following up on these questions. Regarding CDKN2A: I noticed that ENST00000304494 was already the transcript id for CDKN2A in the grch37 hotspots file and this stayed the same in the new grch38 file. So fortunately the hotspots results of this PR are not affected.

Regarding the missing ENSG values: I've downloaded the previous version (https://github.com/genome-nexus/genome-nexus-importer/blob/c7217e17f88991e0cdb2d27be91fe54c76e327a8/data/grch38_ensembl95/export/ensembl_biomart_canonical_transcripts_per_hgnc.txt) and it has the same/similar issues:

image

It must be a bug in this code https://github.com/genome-nexus/genome-nexus-importer/blob/master/scripts/make_one_canonical_transcript_per_gene.py Interesting fact: empty ensembl_canonical_gene and non-empty transcript id only happens in the columns uniprot_canonical_transcript and mskcc_canonical_transcript...

IMO we need a new ticket to fix both these issues.

pieterlukasse commented 2 years ago

New ticket: https://github.com/genome-nexus/genome-nexus-importer/issues/61

inodb commented 2 years ago

@pieterlukasse thanks for filing!

pieterlukasse commented 2 years ago

shall we close this one?