Multiple Reference Genome Migration - defaults and warnings

sheridancbio commented 5 years ago

The 3.1.0-release includes multiple reference genome support — part of this support includes migrating the database to include a reference genome for existing studies. The studies are assigned a reference genome id depending on the NCBI_BUILD present in mutation event records associated with the study.

The migration assumes only one NCBI_BUILD to be returned; however, this is not always the case (potentially due to incorrect data in the database).

Requested improvements are:

make the default reference genome id during migration hg19 (to match the default import behavior when reference genome build is not specified). The current default hg38.
modify the import_db.py script so that a check is performed before doing the migration step which introduces multi-reference genome support. The check will examine all mutation_event records (on a study-by-study basis) and see whether the NCBI_BUILD values are consistent. If not, a warning with details is displayed to the user and the program exits (unless a "--force" command line argument had been provided when script is run). The warning should indicate which genome build would have been set if forced (genome build setting may come from the portal.properties file)

cc: @cBioPortal/importer-pipelines

khzhu commented 5 years ago

thanks, @sheridancbio! working on it will let you know once I am done or have any issues.

khzhu commented 5 years ago

@sheridancbio , I got migration script updated and would like to know which branch should the PR based from (release 3.1.0)? thanks!

khzhu commented 5 years ago

@sheridancbio @n1zea144 , PR is ready for your review. Please let me know anything I might have missed or needed to be changed. Thank you!

sheridancbio commented 5 years ago

Issues have been addressed

cBioPortal / cbioportal

Multiple Reference Genome Migration - defaults and warnings #6566