cBioPortal / datahub

A centralized location for storing curated data from cBioPortal
171 stars 119 forks source link

importGenePanel.pl fails to import gene_panel due to missing gene #1441

Closed mkolb22 closed 2 years ago

mkolb22 commented 3 years ago

The Esophageal Squamous Cell Carcinoma (UCLA, Nat Genet 2014) gene panel fails to import causing the study import to fail: study tar file: escc_ucla_2014.tar.gz gene_panel: data_gene_panel_ucla_1202.txt db seed: seed-cbioportal_hg19_v2.7.3.sql.gz seed version. db schema: 2.12.8

Warnings / Errors:

  1. Could not find gene in the database: KIAA1804; 1x
  2. Gene panel UCLA_1202 cannot be imported because one or more genes in the panel are not found in the database, or are duplicated.; 1x Done. Total time: 2356 ms

I have encountered this issue with 36 out of 306 studies that i have been bulk importing into our local kubernetes cbioportal deployment.

I checked the cbioportal datahub git repository and there is a new seed db: seed-cbioportal_hg19_v2.12.8.sql.gz. Is it possible to upgrade an existing deployment with the new seed db without wiping out the old data?

jjgao commented 3 years ago

@kronik22 thanks for contacting us. The genes in the gene panel fines will need to be recognized. Please fix the unrecognized genes before importing.

I think you may have to reimport the studies after upgrade to a new seed database. @yichaoS do you have any comments?

mkolb22 commented 3 years ago

@jjgao What is the process to fix the unrecognized genes? A large group of the gene panel imports fail due to unable to locate the genes in the database.. ie. importing gene panel data_gene_panel_agilent.txt

Warnings / Errors:

  1. Could not find gene in the database: LOC100287632; 1x
  2. Could not find gene in the database: LOC220594; 1x
  3. Could not find gene in the database: LOC283688; 1x
  4. Could not find gene in the database: LOC283914; 1x
  5. Could not find gene in the database: LOC285419; 1x
  6. Could not find gene in the database: LOC388152; 1x
  7. Could not find gene in the database: LOC401134; 1x
  8. Gene panel Agilent cannot be imported because one or more genes in the panel are not found in the database, or are duplicated.; 1x Done.

I ran the following process to deploy the db in my kubernetes environment: 1) ran the cgds.sql script 2) used a modified version of the migration.sql script with the 2.7.3 and earlier schema changes removed. So, the database updates were from 2.7.4 to 2.12.8. 3) ran the seed-cbioportal_hg19_v2.7.3.sql to populate the db with the 2.7.3 seed data.

This allowed the v3.6.7 container image to connect to the mysql db and run it's bootstrap process and bring the application online.

My current implementation has the same geneset version msigdb_6.1 as what is found in the seed-cbioportal_hg19_v2.12.8.sql file. I also searched the 2.7.3 seed and the 2.12.8 seed sql files for the gene references from the agilent gene panel import output (see above) and the genes were not found. I am beginning to suspect that there are some gene sets which support some of these studies that also need to be imported for the gene panels to import. Or my db deployment process is missing something critical to the process or the order of precedent that I used to build the db caused an issue

Any advice that you can provide would be helpful, since my deployment is a one off. I converted the cbioportal-docker-compose 4 container deployment into a kubernetes one using kubernetes service discovery resources for all the containers/pods and had the connection strings map to the kubernetes fqdn for the kubernetes services names. It worked out really well, since the deployment is incredibly stable and has enterprise resilience compared to docker compose.

The more i dig into the product the more amazed I am what the application can do!!

jjgao commented 3 years ago

@kronik22 you would need to find their hugo gene symbols and replace them.

I am curious what data you are working on? is it mutation data? How big (number of genes) is the gene panel?

mkolb22 commented 3 years ago

@jjgao I am just using the existing gene panels and studies from the cbioportal datahub repository. It looks like the latest seeddb 2.12.8 doesn't have some of the latest gene entry's from 2020 and 2021, since i review the import logs and reference the gene against the NIH website https://www.ncbi.nlm.nih.gov/gene/. Has anyone automated the input of genes into the cbioportal database so that the study import process pulls in missing gene data programmatically. I have noticed from reviewing the study import logs that many import records are dropped due to the a gene not being found in the gene db tables. Are there gene sets outside of the seeddb which have these additional gene references?

jjgao commented 3 years ago

@yichaoS could you help look into this? Thanks.

yichaoS commented 3 years ago

@jjgao @kronik22

The migration script only migrates/updates the schema, but not the content of the seedDB. The seedDB mysql needs to be imported to the database as well after running the migration script. The latest seedDB is versioned as 2.12.8 as well and can be downloaded here: https://github.com/cBioPortal/datahub/blob/master/seedDB/seed-cbioportal_hg19_v2.12.8.sql.gz.

This new seed is based on and includes the complete HGNC Feb 2021 download, plus a small additional from our older gene tables.

For panel UCLA_1202: gene KIAA1804 is in our latest public database (and seedDB) as an alias symbol, associated with entrez ID 84451 - thus the panel should be successfully imported. Could you query your local database instance for this gene symbol exists the same way there?

For panel agilent: We've fixed all the unrecognized genes in all panels and they've been on datahub (https://github.com/cBioPortal/datahub/tree/master/reference_data/gene_panels). The LOC genes that reported missing here above have actually been removed from the panel itself (https://media.githubusercontent.com/media/cBioPortal/datahub/master/reference_data/gene_panels/data_gene_panel_agilent.txt). Could you confirm that you are using the latest panel files from datahub?

sheridancbio commented 3 years ago

The content of the public database for cbioportal.org will not affect the attempt to import a gene panel into a local database. @yichaoS has said that all of the current gene panels are compatible with the latest seed database available from datahub (which appears to be : seed-cbioportal_hg19_v2.12.8.sql.gz). I suggest that the cleanest solution to the user's issue is to rebuild a database from the seed-cbioportal_hg19_v2.12.8.sql.gz starting point (which will then require the re-import of all studies which had been previously imported, because the rest of the database will be lost).

The second option would be to find the compatible versions (from github history) of gene panels which match the seed database that was used when building the database, and import those gene panel files ... and then afterwards, an attempt could be made to import any missing gene panels from the current versions of the gene panels, but gene panels that fail import would need to be adjusted manually to substitute (into the gene panel file) appropriate gene identifiers (which could be numeric entrez_gene_ids) for gene symbols which could not be found. That might be a general fix for a failure to import a gene symbol in a gene panel .... manually edit the file to put in the correct entrez_gene_id (integer number - perhaps located by searching public gene symbol databases such as entrez or ncbi) in place of symbols which were not found in the gene table or alias table.

sheridancbio commented 3 years ago

Also, I believe that there will be difficulty if you try to import the seed database mysql file into an existing database with loaded studies. Altering the gene table contents (in particular, dropping genes) may fail or leave the system in an inconsistent state because there are foreign key constraints linking the gene table column entrez_gene_id to fields in many other tables. Removing linked records from the gene table will either propagate to these other datatypes and drop associated records from them, or (if the checking of foreign key constraints is suppressed) the database may end up with events for genes which have no corresponding entry in the gene table. So I think it is not safe to import a seed database mysql file into an existing database.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.