OmnesRes / onco_lnc

MIT License
34 stars 16 forks source link

Patient Vital Status #1

Closed OmnesRes closed 8 years ago

OmnesRes commented 8 years ago

It was brought to my attention today that some patients in OncoLnc have their vital status listed incorrectly.

I have confirmed the issue. For most cancers and patients, when a patient is dead their 'last_contact_days_to' value is listed as '[Not Applicable]'. However, for some cancers and a fraction of patients a number is listed here as well as in the 'death_days_to' column. The current code will therefore incorrectly view these patients as alive and the value for time may also be incorrect.

This possibly affects cancers CESC, COAD, GBM, HNSC, KIRC, LGG, LUAD, LUSC, OV, READ, SKCM, and STAD.

For most of these cancers the problem only affects a very small number of patients, or possibly only a single patient, and it is unlikely the values for the Cox regressions will see much change. However, the issue is widespread for GBM and OV and these cancers will see large changes in their values once this is fixed.

This seems to be a simple fix, 'death_days_to' needs to be given precedence over 'last_contact_days_to'.

So for every 'cox_regression.py', when code like this is present:

if re.search('^[0-9]+$',i[alive_column]):
    clinical1[-1]=[i[patient_column],int(i[alive_column]),'Alive']
elif re.search('^[0-9]+$',i[death_column]):
    clinical1[-1]=[i[patient_column],int(i[death_column]),'Dead']
else:
    pass

the death_column condition should be checked first.

I am working to implement this change and rerunning the Cox regressions. Once I am confident the values have been fixed I will update every affected table in the repository and rebuild the SQLite database of OncoLnc.

Note: this issue is limited to OncoLnc and does not affect the publication "A pan-cancer analysis of prognostic genes" or the repository https://github.com/OmnesRes/pan_cancer.

OmnesRes commented 8 years ago

I have pushed a commit that fixes all affected cancers. I still need to update the information in the web application.

OmnesRes commented 8 years ago

I have updated all affected excel files available at http://www.oncolnc.org/download/ and rebuilt the SQL database for the web app.

aditiq commented 8 years ago

Hi,

I think there are still issues with the survival status. I checked the STAD database and for many of the samples the overall survival status reads "Dead" even though they are clearly stated as living in the clinical sheet from TCGA. Eg. TCGA-BR-8367. Can you please explain this ? Thanks!

OmnesRes commented 8 years ago

Patients can be listed multiple times in the clinical files. This patient is listed in row 126 and row 127. In row 126 they are listed as "Alive", but in row 127 they are listed as "Dead". I use the clinical information listed in the lower row since it is more up to date.

aditiq commented 8 years ago

Hi - thanks for replying. Which file are you looking at exactly ? I cant trace this back

OmnesRes commented 8 years ago

https://github.com/OmnesRes/onco_lnc/blob/master/tcga_data/STAD/clinical/nationwidechildrens.org_clinical_follow_up_v1.0_stad.txt

OmnesRes commented 8 years ago

I think you were originally looking at this file: https://github.com/OmnesRes/onco_lnc/blob/master/tcga_data/STAD/clinical/nationwidechildrens.org_clinical_patient_stad.txt

I pull clinical data from multiple files and keep the most recent data.

aditiq commented 8 years ago

I am still not clear about this. Can you please clarify from where did you download the file "nationwidechildrens.org_clinical_follow_up_v1.0_stad.txt" ? Or have you generated this ? I am confused because if I download the data from cbioportal and or query this ID in the GDC data portal (https://gdc-portal.nci.nih.gov/cases/b12d9857-9ae1-445b-a963-b630b27b254e) , both of them show the status as ALIVE. Is there some other place that you are getting the most recent data from? thanks!

OmnesRes commented 8 years ago

I downloaded directly from the TCGA https://tcga-data.nci.nih.gov/docs/publications/tcga/? in Jan. 2016. The data appears to have been moved. I'm not sure where to get the file now.

aditiq commented 8 years ago

Ok. I see what might be happening here. The clinical sheet from cbioportal as well as if I download the clinical sheet from GDC the status is ALIVE and days to last followup is 418 but there seems to have been an update in the clinical information where the patient was dead after 801 days. Quite weird that cbioportal is not updating the clinical information. Also I think you should trace back the files. All of them have been moved to https://gdc.nci.nih.gov/.
thanks for your answers.

OmnesRes commented 8 years ago

I was unaware that the TCGA moved the files. I'll look into it.

aditiq commented 8 years ago

Hi -- Sorry to start this again. But even now there are some samples that don't match up. Eg. TCGA-F1-6177 has been reported as "ALIVE" in oncolnc however it is dead if you check both the new portal and the clinical sheet from STAD publication portal (https://tcga-data.nci.nih.gov/docs/publications/stad_2014/20140110_STAD_Clinical_Data_Blacklisted_Cases_Removed.xlsx)

OmnesRes commented 8 years ago

The data OncoLnc uses was downloaded Jan. 5 and 6th 2016. For this patient the only available clinical data was present in this file: https://github.com/OmnesRes/onco_lnc/blob/master/tcga_data/STAD/clinical/nationwidechildrens.org_clinical_patient_stad.txt

I currently don't have a plan for when new clinical data will be downloaded and incorporated into OncoLnc. OncoLnc is meant as an exploration tool. If you have a cancer of interest and have downloaded the most recent clinical data you should use that data. The expression data in OncoLnc should still be valid and up to date.