cBioPortal / cbioportal

cBioPortal for Cancer Genomics
https://cbioportal.org
GNU Affero General Public License v3.0
575 stars 438 forks source link

Clarification needed on how to handle missing gene panel identifiers in data files #10871

Open sheridancbio opened 5 days ago

sheridancbio commented 5 days ago

This relates to cases where a study contains a sample which appears to be part of a genetic profile, but the sample is not present in data_gene_matrix.txt, or the gene panel id value is 'NA' or missing for a sample which is present in data_gene_matrix.txt.

importation into the raw cbioportal data tables (i.e. sample_profile)

During import into the cBioPortal database, the values from data_gene_matrix.txt are loaded into the table sample_profile. According to the file format documentation here: https://docs.cbioportal.org/file-formats/#gene-panel-matrix-file we have this direction: "When the sample is not profiled on a gene panel, or if the sample is not profiled at all, use NA as value. If the sample is profiled for mutations, make sure it is also in the _sequenced case list." I think this specification should be clarified. My reading of this is that:

I would expect these conditions to be flagged as errors during validation:

If my understanding is correct, I think the documentation should be made more specific to assert these rules clearly. Additionally, the importer codebase should be tested. It appears that currently it is permissible for samples to be unmentioned / absent from data_gene_matrix.txt and that import can still succeed. The results for a sample which is not mentioned in data_gene_matrix.txt seems to depend on whether or not detected mutation events are present in data_mutations_extended.txt .. so that samples which are mentioned in case_lists/cases_sequenced.txt but which have no detected mutation events and which are not listed in data_gene_matrix.txt are imported into the database (sample_profile) without a recorded gene panel and appear to be unsequenced in certain contexts. Importer unit tests should be written for all condition combinations (presence/absence in data_gene_matrix.txt, mutations column value (NA / valid_panel / invalid_panel_id), presence/absence in case_lists/cases_sequenced.txt, samples with/without detected importable (non-silent) mutations in data_mutations_extended.txt) and the business logic should be adjusted to properly handle each test case. The validator should be also updated to properly validate the requirements.

Another thing to be specified is what representation should be present (if any) in the database table sample_profile for a sample which was:

The PANEL_ID field is an integer. If a sample was not sequenced should it be present or absent from sample_profile ... and if present, should the PANEL_ID value be null? If a sample was sequenced with WGS/WES sequencing should it be present, and if so, what value should PANEL_ID hold?

sheridancbio commented 5 days ago

This issue was created after review of #10867 (@haynescd)