Sage-Bionetworks / GENIE-Sponsored-Projects

This repository contains processing code for GENIE sponsored projects used for creating cBioPortal files
MIT License
5 stars 1 forks source link

Fixes for latest PANC files #48

Closed hhunterzinck closed 2 years ago

hhunterzinck commented 2 years ago

Issues with cBioPortal files for PANC files after latest derived variable code update and mapping update reported by cBioPortal:

  1. data_clinical_patient.txt a. ca_resect_status variable missing -> this was an issue w. mapping file which I can upload. please confirm that this is not also a scope of release issue. b. n_rt_pt variable missing -> confirmed this is in the mapping file so perhaps a scope of release issue
  2. data_clinical_supp_survival.txt a. We only include survival from the first index cancer diagnosis. All survival data is being pulled. I think Tom’s code used to filter on tt_first_index_ca, but as that variable has been renamed, I am guessing that is where the problem is.
  3. data_clinical_sample.txt a. Sample types (primary vs. metastasis) are not matching up between the derived variable files vs. portal files. Mainly in the DFCI patients.
  4. data_timeline_treatment.txt a. Event_Type should be Treatment for both ca_drugs and radiation (not Treatment and Radiation Therapy) b. TREATMENT_TYPE column Medical Type needs to be converted to Systemic Therapy and Radiation Therapy needs to be differentiated in this column instead of EVENT_TYPE column c. I think there were missing start dates that were removed… should we confirm with the other teams that this data is a noted exception?
hhunterzinck commented 2 years ago
  1. data_clinical_patient.txt a. Updated data elements catalog (syn21431364) to include ca_resect_status b. Updated data elements catalog (syn21431364) to include n_rt_pt. Marked as N in mapping file (syn25585554) and but yes in SOR (syn22294851) for release in PANC, so mapping file needs to be corrected.
  2. data_clinical_supp_survival.txt a. @thomasyu888 the tt_first_index_ca was updated to the new name dob_first_index_ca in the code but I can't find any explicit filtering step in the code based on this variable. Can you point me to the lines where this filtering is supposed to occur?
  3. data_clinical_sample.txt a. Should discuss with QA managers
  4. data_timeline_treatment.txt a. Event_Type is now Treatment for both ca_drugs and radiation (not Treatment and Radiation Therapy) b. TREATMENT_TYPE column Medical Type needs to be converted is now Systemic Therapy for drugs and Radiation Therapy for radiation c. My understanding was that this was a complex query covered by the stats team
hhunterzinck commented 2 years ago
  1. data_clinical_supp_survival.txt a. Filtering occurs at https://github.com/Sage-Bionetworks/GENIE-Sponsored-Projects/blob/9638007730d9a1bf27f929f3a62a9d6ae5944df6/geniesp/bpc_redcap_export_mapping.py#L1060
hhunterzinck commented 2 years ago
  1. data_clinical_supp_survival.txt a. Upon further discussion with the cBioPortal team, this is not caused because of pulling index and non-index cancer survival data but because they "only want to include the row that has both OS and PFS data" in 'ca_dx_derived.csv' (syn22296816)
hhunterzinck commented 2 years ago
  1. data_clincial_supp_survival.txt a. filter(!is.na(PFS_I_ADV_STATUS) & PFS_I_ADV_STATUS != "" | !duplicated(PATIENT_ID) essentially, retain the row if PFS_I_ADV_STATUS contains data OR the patient_id is unique tested on current survival data file and produces unique patient list
hhunterzinck commented 2 years ago
  1. data_clinical_sample.txt a. Sage will change in tables to be all text and add an upload QA check to flag any non-text values in the cpt_sample_type column
thomasyu888 commented 2 years ago

@hhunterzinck please take a look at this commit: https://github.com/Sage-Bionetworks/GENIE-Sponsored-Projects/pull/50/files#diff-e9116b223fd4a63dc4a0c3c0cbfa2baccb892d5d0941ec0f5a4a2aa73b51041bR1068-R1075 for 2a.