cBioPortal / cbioportal

cBioPortal for Cancer Genomics
https://cbioportal.org
GNU Affero General Public License v3.0
582 stars 451 forks source link

TCGA Disease Free Survival Issue #3101

Closed ecerami closed 6 years ago

ecerami commented 6 years ago

Reported by end user:

I’m trying to reproduce your disease free survival plots but am unclear how your pipelines select from multiple relapse events for the same patients. For example patient TCGA-CU-A72E has two new tumor event times: 256 days and 364 days. Cbioportal selects 364 days. Isn’t 256 more correct if we are trying to measure disease free survival?

More info from user:

Here is a link to gdc data portal from which you can download clinical metadata for TCGA-CU-A72E: https://portal.gdc.cancer.gov/files/4f3ae24f-eecd-4ba3-a592-cb9af99ed6e2

Download, de-compress, open and go to follow-ups section. Under follow-ups you’ll find two new tumor event entries (256 and 364).

BLCA GDAC firehose merged clinical files also shows these two events.

Notes from Ethan: Reached out to Ben, who confirmed that this calculation is done within one of the MSK pipelines (that is not currently on github). Also confirmed with @schultzn that we should be using the first event, not the second.

n1zea144 commented 6 years ago

Our pipeline uses the biotab (tab delimited) version of the clinical data, not the XML flavor that the user references. At first glance I thought there was a conflict between the clinical XML file in the GDC Data Portal that was referenced by the user and the data found in the legacy archive that we used when we last converted TCGA data. After an email exchange with someone at the GDC, it turns out I was looking at the wrong place.

The value of DFS_MONTHS that we computed (11.96) is coming from the new_tumor_event_dx_days_to (364) from this file:

legacy archive nationwidechildrens.org_clinical_follow_up_v4.0_nte_blca.txt

After closer inspection of this file, it looks like there are multiple records for the same patient, and we are in fact missing an additional record which contains the new_tumor_event_dx_days_to = 256 days. The TCGA converter will have to be updated to account for multiple recurrent events and choose the proper (most recent after treatment) one. As an extra level of complexity, there is a new tumor event type with values such as: "Distant Metastasis", "Locoregional", "New Primary Tumor" which maybe should be considered when choosing the correct new_tumor_event_dx_days_to. If this is true, we should access the value of new_tumor_event_type across all studies and determine a ranking.

With that said, we are going to have to prioritize the addressing of this issue, which includes:

Below are some URLs for convenient/future reference.

TCGA BLCA files on GDC legacy archive.

Clinical XML file on Legacy Archive

Clinical XML file on GDC Data Portal

These XML files are identical.

n1zea144 commented 6 years ago

Moving this issue to datahub for triage.