cBioPortal / rfc80-team

repository to hold issues for the rfc80 development / deployment team
0 stars 0 forks source link

Number of Profiled Cases count is off from legacy mutated-genes endpoint #20

Open haynescd opened 1 month ago

haynescd commented 1 month ago

Currently the new (clickhouse) endpoint for fetching AlterationsCountByGenes for Mutations (/api/mutated-gens/fetch) totalProfiledCases Count is below the legacy count by 4.

Difference found at TFRC.numberOfProfiledCases: (Legacy) 13638 != (New) 13634

After doing some initial research I have found that there are 4 samples that are not profiled at all. (I do not know if this makes sense... having samples in a study that are not profiled at all)

select count(distinct sample_id) from sample_profile INNER JOIN sample on sample_profile.sample_id = sample.internal_id INNER JOIN patient AS p ON sample.patient_id = p.internal_id INNER JOIN cancer_study AS cs ON p.cancer_study_id = cs.cancer_study_id where cancer_study_identifier = 'genie_public'; Returns 197976

select count(distinct sample_unique_id) from sample_view where cancer_study_identifier = 'genie_public'; Returns 197976

Query I used to determine which samples were not profiled.

select distinct sample_stable_id from sample_view where cancer_study_identifier = 'genie_public' and sample_stable_id not in ( SELECT DISTINCT s.stable_id FROM sample_profile sp INNER JOIN sample s ON sp.sample_id = s.internal_id INNER JOIN patient p ON s.patient_id = p.internal_id INNER JOIN cancer_study cs ON p.cancer_study_id = cs.cancer_study_id WHERE cs.cancer_study_identifier = 'genie_public' );

List of samples missing.