broadinstitute / gdctools

Python and UNIX CLI utilities to simplify interaction with the NIH/NCI Genomics Data Commons
Other
31 stars 4 forks source link

need to cross-check samples for consistency #83

Open noblem opened 6 years ago

noblem commented 6 years ago

Today in exploring UCEC CPTAC3 genomic data I noticed that the GDCtools-generated sample reports listed fewer clinical samples than the total number of samples which had at least 1 annotation in the loadfiles (177 clinical samples, 180 lines in loadfile). This led me to discover that some samples had molecular data but no clinical data: for example, the patient case C3L-00084 in UCEC was "stopped" (in CPTAC terminology), which AFAIK is equivalent to being redacted, but the molecular data were not removed from the DCC.

GDCtools can easily flag this situation and raise awareness

noblem commented 6 years ago

For completeness, the other 2 UCEC samples which exhibited this issue are given below:

cut -f1,7 loadfiles/google/CPTAC3/latest/CPTAC3.Sample.loadfile.txt  | grep NULL
CPTAC3-UCEC-C3L-00084-TP        gs://broad-institute-gdac/GDAC_FC_NULL
CPTAC3-UCEC-C3L-00930-TP        gs://broad-institute-gdac/GDAC_FC_NULL
CPTAC3-UCEC-C3L-01284-TP        gs://broad-institute-gdac/GDAC_FC_NULL