Closed gwaybio closed 6 years ago
Here's the code in question:
If some patients have multiple samples, do we want to include all of them. Or should we only include Metastatic
if that's a patient's sole assayed tumor? Given that there are not a number number of metastatic cancers, I don't think we'll encounter many issues from breaking the independence of observation assumption of many classifiers.
Here are the total counts from the notebook output (cell 9):
Primary Solid Tumor 10517
Solid Tissue Normal 1413
Metastatic 395
Primary Blood Derived Cancer - Peripheral Blood 200
Recurrent Solid Tumor 55
Additional - New Primary 10
Additional Metastatic 1
Recurrent Solid Tumor
may also be something worth including?
I looked into this issue in a bit more detail. It looks like there are 395 total Metastatic
tumors in the dataset with the acronym distribution:
SKCM 368
THCA 8
BRCA 7
HNSC 2
PCPG 2
CESC 2
SARC 1
COAD 1
ESCA 1
PAAD 1
PRAD 1
BLCA 1
Of these 395 tumors 33 (8%) also have primary tumor info. The acronym distribution for these duplicate samples is:
THCA 8
BRCA 7
SKCM 6
PCPG 2
CESC 2
HNSC 2
BLCA 1
SARC 1
PAAD 1
PRAD 1
ESCA 1
COAD 1
Therefore, after removing these duplicate Metastatic tumors (and retaining Metastatic lesions without primary), the acronym distribution is:
SKCM 362
So, I believe we are doing a disservice by removing all metastatic tumors and this is likely to be a quick fix. We can remove the 33 duplicate samples if we are worried about non-independence - although it probably wouldn't impact classifier much
@dhimmel - I can go ahead and add these quick lines if you approve
Okay I'm not sure these metastatic SKCM tumors have mutation or expression data, but I agree that they should be included if they do.
Recurrent Solid Tumor may also be something worth including?
There are 55 Recurrent tumors and the acronym distribution is:
OV 18
LGG 14
GBM 13
SARC 3
LUAD 2
LIHC 2
READ 1
COAD 1
UCEC 1
After removing duplicate Recurrent tumors, the remaining samples are:
OV 2
So two Ovarian tumors are retained - perhaps we should be consistent and also keep these two (at least if they also have gene expression + mutation data)
Currently (in Cell 11 of
2.TCGA-process.ipynb
), we retain onlyPrimary Solid Tumor
andPrimary Blood Derived Cancer - Peripheral Blood
. In #44 it was determined that 389 samples (with mutation and gene expression data) were missing clinical annotations. It likely that many of these samples were removed from the clinical matrix by cell 11 above.We should consider adding
Metastatic
and to Cell 11.