cBioPortal / cbioportal

cBioPortal for Cancer Genomics
https://cbioportal.org
GNU Affero General Public License v3.0
637 stars 492 forks source link

Violin plot discrepancy #11112

Open alisman opened 3 days ago

alisman commented 3 days ago

@onursumer i'm seeing a discrepancy in the sample ids in the response. Looks like counts are matching.

Clickhouse: image

Legacy image

onursumer commented 3 days ago

We get the sampleId for Clinical Data from clinical_data_derived,

https://github.com/cBioPortal/cbioportal/blob/7cd50ec7eeba803d673dfea012b953bddb09424f/src/main/resources/org/cbioportal/persistence/mybatisclickhouse/StudyViewMapper.xml#L92-L99

and we get the sampleId for clinical_data_derived from sample_derived.

https://github.com/cBioPortal/cbioportal/blob/93dca943980cbaacadd39ac42ad1671ab33a2303/src/main/resources/db-scripts/clickhouse/clickhouse.sql#L226-L233

In the sample_derived we construct the sample unique id by adding the study id in front of the sample stable id

https://github.com/cBioPortal/cbioportal/blob/93dca943980cbaacadd39ac42ad1671ab33a2303/src/main/resources/db-scripts/clickhouse/clickhouse.sql#L66-L67

Legacy implementation, on the other hand, just uses stable ids when generating the clinical data.

https://github.com/cBioPortal/cbioportal/blob/f7fdc1a4e5247d5b07066c1f4d17b5e9a0089f0b/src/main/resources/org/cbioportal/persistence/mybatis/ClinicalDataMapper.xml#L8-L9

This is why we end up with sample ids like skcm_mskcc_2014_CR04885 in the clickhouse implementation instead of CR04885.

onursumer commented 3 days ago

Modifying the getSampleClinicalDataFromStudyViewFilter mapper to use the stable id instead of the unique id should fix this discrepancy, but we need to make sure that it doesn't break something else.

alisman commented 2 days ago

@onursumer just adding what i think you already know that the problem is in convertPatientClinicalDataToSampleClinicalData where we lose data presumably because of overlapping ids or something.