cBioPortal / cbioportal

cBioPortal for Cancer Genomics
https://cbioportal.org
GNU Affero General Public License v3.0
654 stars 511 forks source link

Normalize Reference Information for Genes #6189

Closed inodb closed 3 years ago

inodb commented 5 years ago

Currently we have a gene to transcript mapping in cBioPortal and there is one in Genome Nexus (https://github.com/genome-nexus/genome-nexus-importer/blob/master/data/grch37_ensembl92/export/ensembl_biomart_canonical_transcripts_per_hgnc.txt). We should try to have a single one.

Further discussion:

Check:

CC @sheridancbio @n1zea144 @zhx828 @jjgao

jjgao commented 5 years ago

Thanks, @inodb.

Just copied all the columns below and let's discuss which ones to keep or add.

hgnc_symbol KMT2B KMT2D  
ensembl_canonical_gene ENSG00000272333 ENSG00000167548
ensembl_canonical_transcript ENST00000222270 ENST00000301067
genome_nexus_canonical_transcript ENST00000222270 ENST00000301067
uniprot_canonical_transcript ENST00000420124 ENST00000301067
mskcc_canonical_transcript ENST00000222270 ENST00000301067
hgnc_id HGNC:15840 HGNC:7133  
approved_name lysine methyltransferase 2B lysine methyltransferase 2D
locus_group protein-coding gene protein-coding gene
locus_type gene with protein product gene with protein product
status Approved Approved  
chromosome 19q13.12 12q13.12  
location_sortable 19q13.12 12q13.12  
synonyms KIAA0304, MLL2, TRX2, HRX2, WBP7, MLL1B, MLL4, CXXC10 ALR, MLL4, CAGL114
alias_name myeloid/lymphoid or mixed-lineage leukemia (trithorax homolog, Drosophila) 4, Histone-lysine N-methyltransferase 2B histone-lysine N-methyltransferase 2D
previous_symbols   TNRC21, MLL2
prev_name lysine (K)-specific methyltransferase 2B trinucleotide repeat containing 21, myeloid/lymphoid or mixed-lineage leukemia 2, lysine (K)-specific methyltransferase 2D
gene_family PHD finger proteins, Zinc fingers CXXC-type, Lysine methyltransferases, SET domain containing PHD finger proteins, Lysine methyltransferases, Trinucleotide repeat containing, SET domain containing
gene_family_id 88, 136, 487, 1399 88, 487, 775, 1399
date_approved_reserved 5/9/13 10/14/98  
date_symbol_changed   5/9/13  
date_name_changed 2/12/16 2/12/16  
date_modified 3/6/18 3/6/18  
entrez_gene_id 9757 8085  
vega_id OTTHUMG00000048119 OTTHUMG00000166524
ucsc_id      
accession_numbers AJ007041 AF010403  
refseq_ids NM_014727 NM_003482  
ccds_id CCDS46055 CCDS44873  
uniprot_id Q9UMN6 O14686  
pubmed_id 10409430, 10637508 9247308  
mgd_id MGI:109565 MGI:2682319
rgd_id RGD:7678027 RGD:2324324
lsdb      
cosmic   KMT2D  
omim_id 606834 602113  
mirbase      
homeodb      
snornabase      
bioparadigms_slc      
orphanet   239011  
pseudogene.org      
horde_id      
merops      
imgt      
iuphar objectId:2689 objectId:2691
kznf_gene_catalog      
mamit-trnadb      
cd      
lncrnadb      
enzyme_id      
intermediate_filament_db      
rna_central_ids      
jjgao commented 5 years ago

I looked at our cbioportal database, I think the data above covers everything we need. (We don't have length, but I think we can remove the LENGTH COLUMN in the GENE table now - it was previously used in the Mutated Genes tab).

@n1zea144 could someone in your group also take a look, e.g. check all genes in the current portal database are covered.

@zhx828 could you check if all genes in oncokb are covered?

zhx828 commented 5 years ago

@inodb @jjgao I think due to recent gene updates in portal, some genes in OncoKB are no longer match with GN and portal. I will need to update the genes in the next release. https://docs.google.com/spreadsheets/d/1mqmH1ccKWli7te7L8v0lIQh6uWbnSf2gu07l76RL-gI/edit?usp=sharing

n1zea144 commented 5 years ago

I will ask one of the curators (probably @rmadupuri) to take a look at gene coverage.

zhx828 commented 5 years ago

@inodb These two genes in GN use different uniport isoforms comparing to the vcf2mac uniport file

gene isoform
HIST1H2BO ENST00000616182
ARID3B ENST00000622429
jjgao commented 5 years ago

@inodb @zhx828 @n1zea144 @rmadupuri please prioritize this one. This will fix a couple of existing issues, e.g. #5910 and cBioPortal/datahub#540 and will give us a clean start to re-import all studies.

rmadupuri commented 5 years ago

Hi @jjgao, the following genes in database did not have matches in GN. Divided them to 4 sheets. https://docs.google.com/spreadsheets/d/1JCx12E86TGbMydRuzwFSUVgJ2HwlGAvKFViDfgaEK2U/edit?usp=sharing

  1. 76 Protein coding genes (Sheet 1)
  2. 1820 genes of other types (ncRNA, pseudo, rRNA, scRNA, tRNA, biological-region) (Sheet 2)
  3. 24,925 genes whose name starts with LOC.. (Sheet 3) (not sure if needed)
  4. 373 miRNA genes (Sheet 4)

The above miRNA's might have been covered in GN but its not easy to compare since the portal has negative entrez ids and GN has positive ids.

jjgao commented 5 years ago

I looked at a few protein-coding genes:

We should do a more systematic analysis. Before doing that, I am wondering if you can help to add a couple of more columns in the spreadsheet so that we know how much data are there for each gene in the public portal? @rmadupuri

jjgao commented 5 years ago

related issue https://github.com/cBioPortal/cbioportal/issues/6432

jjgao commented 5 years ago

Once we switch, it is also an opportunity to switch to previous symbols instead of synonyms and hopefully remove manny ambiguity, e.g. mll2. We maybe able too remove this file too: https://github.com/cBioPortal/cbioportal/blob/master/core/src/main/resources/gene_symbol_disambiguation.txt.

n1zea144 commented 5 years ago

Hi @jjgao - @inodb and I met yesterday to discuss this effort and we have some thoughts about this that I will outline in a google doc / rfc. I'll link back once I have a draft.

n1zea144 commented 5 years ago

@jjgao Per our discussion, we will revisit the utility of Entrez, or Ensembl id's in a later effort. With that, we can build out the roadmap of work to be done on this issue. cc: @inodb

Here is link to RFC: Gene Data RFC

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

inodb commented 4 years ago

still an issue

jjgao commented 4 years ago

@inodb what's the status on this? Should we turn it into an epic?

ritikakundra commented 4 years ago

@inodb @jjgao This is now part of our scrum planning (been for 2 weeks). We have been working on finding the best source for our gene table and whatever utilization leads to what kind of data loss.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

jjgao commented 3 years ago

@inodb @yichaoS can this be closed?

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

jjgao commented 3 years ago

Closing this. Please create new issues if needed.