broadinstitute / gdctools

Python and UNIX CLI utilities to simplify interaction with the NIH/NCI Genomics Data Commons
Other
31 stars 4 forks source link

loadfiles need to contain unadulterated case id #88

Open noblem opened 5 years ago

noblem commented 5 years ago

at present the loadfiles generated by GDCtools do not give the case_id or submitter_id associated with each row in the file. granted, those can be inferred pretty easily for TCGA samples, but not everyone will know the right way to do so (think about newcomers who don't know TCGA history), nor is this kind of "guessing" a robust strategy because it could in principle be different for each new data program at the GDC (and we should assume it will be and guard against such in the code). Therefore we should include a way of instantly & unambiguously mapping each row in a loadfile back to the identifiers it came with from the GDC ... either the case_id proper (which is a UUID) or the submitter_id ... which is akin to the TCGA participant barcode ... or both?