Normalize Reference Information for Genes

inodb commented 5 years ago

Currently we have a gene to transcript mapping in cBioPortal and there is one in Genome Nexus (https://github.com/genome-nexus/genome-nexus-importer/blob/master/data/grch37_ensembl92/export/ensembl_biomart_canonical_transcripts_per_hgnc.txt). We should try to have a single one.

[ ] Check if the Genome Nexus mappings work for cBioPortal
[ ] Check if the Genome Nexus mappings work for OncoKB
[ ] regenerate the cbioportal seed database
[ ] normalize oncokb genes

Further discussion:

When importing MAF into cBioPortal we use the Entrez ID. Is it possible that sometimes entrez ids get imported that don't have an associated hugo symbol? This would make it impossible to query the data
cBioPortal doesn't really have a gene to ensembl transcript mapping, but rather gene to entrez id where entrez id represents a particular transcript (?)
datahub now uses some JSON dump from cBioPortal to validate the mutation file. This differs from live cBioPortal
Should we use hugo symbol in cBioPortal over Entrez Id?

Check:

[ ] Is it correct that most of our data has an entrez id? And how is this computed if it doesn't exist?
[ ] How many MAF files don't have genomic location information?

CC @sheridancbio @n1zea144 @zhx828 @jjgao

jjgao commented 5 years ago

Thanks, @inodb.

Just copied all the columns below and let's discuss which ones to keep or add.

hgnc_symbol	KMT2B	KMT2D
ensembl_canonical_gene	ENSG00000272333	ENSG00000167548
ensembl_canonical_transcript	ENST00000222270	ENST00000301067
genome_nexus_canonical_transcript	ENST00000222270	ENST00000301067
uniprot_canonical_transcript	ENST00000420124	ENST00000301067
mskcc_canonical_transcript	ENST00000222270	ENST00000301067
hgnc_id	HGNC:15840	HGNC:7133
approved_name	lysine methyltransferase 2B	lysine methyltransferase 2D
locus_group	protein-coding gene	protein-coding gene
locus_type	gene with protein product	gene with protein product
status	Approved	Approved
chromosome	19q13.12	12q13.12
location_sortable	19q13.12	12q13.12
synonyms	KIAA0304, MLL2, TRX2, HRX2, WBP7, MLL1B, MLL4, CXXC10	ALR, MLL4, CAGL114
alias_name	myeloid/lymphoid or mixed-lineage leukemia (trithorax homolog, Drosophila) 4, Histone-lysine N-methyltransferase 2B	histone-lysine N-methyltransferase 2D
previous_symbols		TNRC21, MLL2
prev_name	lysine (K)-specific methyltransferase 2B	trinucleotide repeat containing 21, myeloid/lymphoid or mixed-lineage leukemia 2, lysine (K)-specific methyltransferase 2D
gene_family	PHD finger proteins, Zinc fingers CXXC-type, Lysine methyltransferases, SET domain containing	PHD finger proteins, Lysine methyltransferases, Trinucleotide repeat containing, SET domain containing
gene_family_id	88, 136, 487, 1399	88, 487, 775, 1399
date_approved_reserved	5/9/13	10/14/98
date_symbol_changed		5/9/13
date_name_changed	2/12/16	2/12/16
date_modified	3/6/18	3/6/18
entrez_gene_id	9757	8085
vega_id	OTTHUMG00000048119	OTTHUMG00000166524
ucsc_id
accession_numbers	AJ007041	AF010403
refseq_ids	NM_014727	NM_003482
ccds_id	CCDS46055	CCDS44873
uniprot_id	Q9UMN6	O14686
pubmed_id	10409430, 10637508	9247308
mgd_id	MGI:109565	MGI:2682319
rgd_id	RGD:7678027	RGD:2324324
lsdb
cosmic		KMT2D
omim_id	606834	602113
mirbase
homeodb
snornabase
bioparadigms_slc
orphanet		239011
pseudogene.org
horde_id
merops
imgt
iuphar	objectId:2689	objectId:2691
kznf_gene_catalog
mamit-trnadb
cd
lncrnadb
enzyme_id
intermediate_filament_db
rna_central_ids

jjgao commented 5 years ago

I looked at our cbioportal database, I think the data above covers everything we need. (We don't have length, but I think we can remove the LENGTH COLUMN in the GENE table now - it was previously used in the Mutated Genes tab).

@n1zea144 could someone in your group also take a look, e.g. check all genes in the current portal database are covered.

@zhx828 could you check if all genes in oncokb are covered?

zhx828 commented 5 years ago

@inodb @jjgao I think due to recent gene updates in portal, some genes in OncoKB are no longer match with GN and portal. I will need to update the genes in the next release. https://docs.google.com/spreadsheets/d/1mqmH1ccKWli7te7L8v0lIQh6uWbnSf2gu07l76RL-gI/edit?usp=sharing

n1zea144 commented 5 years ago

I will ask one of the curators (probably @rmadupuri) to take a look at gene coverage.

zhx828 commented 5 years ago

@inodb These two genes in GN use different uniport isoforms comparing to the vcf2mac uniport file

gene	isoform
HIST1H2BO	ENST00000616182
ARID3B	ENST00000622429

jjgao commented 5 years ago

@inodb @zhx828 @n1zea144 @rmadupuri please prioritize this one. This will fix a couple of existing issues, e.g. #5910 and cBioPortal/datahub#540 and will give us a clean start to re-import all studies.

rmadupuri commented 5 years ago

Hi @jjgao, the following genes in database did not have matches in GN. Divided them to 4 sheets. https://docs.google.com/spreadsheets/d/1JCx12E86TGbMydRuzwFSUVgJ2HwlGAvKFViDfgaEK2U/edit?usp=sharing

76 Protein coding genes (Sheet 1)
1820 genes of other types (ncRNA, pseudo, rRNA, scRNA, tRNA, biological-region) (Sheet 2)
24,925 genes whose name starts with LOC.. (Sheet 3) (not sure if needed)
373 miRNA genes (Sheet 4)

The above miRNA's might have been covered in GN but its not easy to compare since the portal has negative entrez ids and GN has positive ids.

jjgao commented 5 years ago

I looked at a few protein-coding genes:

GAGE3 was withdrawn: https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/HGNC:4100
FAM240C is a placeholder symbol (not sure what that means): https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/HGNC:54200
ZNF765-ZNF761 does not seem to have a hugo symbol: https://www.ncbi.nlm.nih.gov/gene/110116772
LITAFD seems regular: https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/HGNC:53927

We should do a more systematic analysis. Before doing that, I am wondering if you can help to add a couple of more columns in the spreadsheet so that we know how much data are there for each gene in the public portal? @rmadupuri

number of mutations of each gene in mutation_event table
number of rows of each gene in genetic_alteration table

jjgao commented 5 years ago

Once we switch, it is also an opportunity to switch to previous symbols instead of synonyms and hopefully remove manny ambiguity, e.g. mll2. We maybe able too remove this file too: https://github.com/cBioPortal/cbioportal/blob/master/core/src/main/resources/gene_symbol_disambiguation.txt.

n1zea144 commented 5 years ago

Hi @jjgao - @inodb and I met yesterday to discuss this effort and we have some thoughts about this that I will outline in a google doc / rfc. I'll link back once I have a draft.

n1zea144 commented 5 years ago

@jjgao Per our discussion, we will revisit the utility of Entrez, or Ensembl id's in a later effort. With that, we can build out the roadmap of work to be done on this issue. cc: @inodb

Here is link to RFC: Gene Data RFC

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

inodb commented 4 years ago

still an issue

jjgao commented 4 years ago

@inodb what's the status on this? Should we turn it into an epic?

ritikakundra commented 4 years ago

@inodb @jjgao This is now part of our scrum planning (been for 2 weeks). We have been working on finding the best source for our gene table and whatever utilization leads to what kind of data loss.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

jjgao commented 3 years ago

@inodb @yichaoS can this be closed?

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

jjgao commented 3 years ago

Closing this. Please create new issues if needed.

cBioPortal / cbioportal

Normalize Reference Information for Genes #6189