cognoma / cancer-data

TCGA data acquisition and processing for Project Cognoma
Other
20 stars 28 forks source link

Retain Metastatic Tumors #46

Closed gwaybio closed 6 years ago

gwaybio commented 6 years ago

Currently (in Cell 11 of 2.TCGA-process.ipynb), we retain only Primary Solid Tumor and Primary Blood Derived Cancer - Peripheral Blood. In #44 it was determined that 389 samples (with mutation and gene expression data) were missing clinical annotations. It likely that many of these samples were removed from the clinical matrix by cell 11 above.

We should consider adding Metastatic and to Cell 11.

dhimmel commented 6 years ago

Here's the code in question:

https://github.com/cognoma/cancer-data/blob/93e4c53dc3d58df4cf52d1a40179d62ccbc0b985/scripts/2.TCGA-process.py#L165-L171

If some patients have multiple samples, do we want to include all of them. Or should we only include Metastatic if that's a patient's sole assayed tumor? Given that there are not a number number of metastatic cancers, I don't think we'll encounter many issues from breaking the independence of observation assumption of many classifiers.

dhimmel commented 6 years ago

Here are the total counts from the notebook output (cell 9):

Primary Solid Tumor                                10517
Solid Tissue Normal                                 1413
Metastatic                                           395
Primary Blood Derived Cancer - Peripheral Blood      200
Recurrent Solid Tumor                                 55
Additional - New Primary                              10
Additional Metastatic                                  1

Recurrent Solid Tumor may also be something worth including?

gwaybio commented 6 years ago

I looked into this issue in a bit more detail. It looks like there are 395 total Metastatic tumors in the dataset with the acronym distribution:

SKCM    368
THCA      8
BRCA      7
HNSC      2
PCPG      2
CESC      2
SARC      1
COAD      1
ESCA      1
PAAD      1
PRAD      1
BLCA      1

Of these 395 tumors 33 (8%) also have primary tumor info. The acronym distribution for these duplicate samples is:

THCA    8
BRCA    7
SKCM    6
PCPG    2
CESC    2
HNSC    2
BLCA    1
SARC    1
PAAD    1
PRAD    1
ESCA    1
COAD    1

Therefore, after removing these duplicate Metastatic tumors (and retaining Metastatic lesions without primary), the acronym distribution is:

SKCM    362

So, I believe we are doing a disservice by removing all metastatic tumors and this is likely to be a quick fix. We can remove the 33 duplicate samples if we are worried about non-independence - although it probably wouldn't impact classifier much

gwaybio commented 6 years ago

@dhimmel - I can go ahead and add these quick lines if you approve

dhimmel commented 6 years ago

Okay I'm not sure these metastatic SKCM tumors have mutation or expression data, but I agree that they should be included if they do.

gwaybio commented 6 years ago

Recurrent Solid Tumor may also be something worth including?

There are 55 Recurrent tumors and the acronym distribution is:

OV      18
LGG     14
GBM     13
SARC     3
LUAD     2
LIHC     2
READ     1
COAD     1
UCEC     1

After removing duplicate Recurrent tumors, the remaining samples are:

OV    2

So two Ovarian tumors are retained - perhaps we should be consistent and also keep these two (at least if they also have gene expression + mutation data)