cBioPortal / cbioportal

cBioPortal for Cancer Genomics
https://cbioportal.org
GNU Affero General Public License v3.0
628 stars 480 forks source link

Additional review needed for the identification of WGS/WES samples in clickhouse table development #10872

Open sheridancbio opened 3 months ago

sheridancbio commented 3 months ago

This relates to cases where a study contains a sample which appears to be part of a genetic profile, but the sample is not present in data_gene_matrix.txt, or the gene panel id value is 'NA' or missing for a sample which is present in data_gene_matrix.txt.

translation of raw cbioportal database tables into derived clickhouse tables (e.g. sample_to_gene_panel_derived)

Scripts have been developed to produce flattened tables and views for clickhouse development efforts underway. See: https://github.com/cBioPortal/cbioportal/blob/79d36e73f1aeff6d0ab4697e77aa210752772ad6/src/main/resources/db-scripts/clickhouse/clickhouse.sql#L17

These scripts attempt to connect the PANEL_ID field from the sample_profile table to the panels present in the gene_panel table, and if there is no connecting gene panel then the value 'WES' is used in place of the (missing) gene panel stable id. This logic should be considered in combination with discussions around #10871, where 'NA' values in data_gene_matrix.txt might or might not be present and the resulting imported data might or might not introduce record into sample_profile based on the presence of detected non-silent mutations importer into the mutations table for the sample.

Once the expected data representation in sample_profile is determined and specified for WGS/WES and for non-profiled samples, the logic in these scripts should be examined and updated if necessary.

sheridancbio commented 3 months ago

This issue was created after review of #10867 (@haynescd)