broadinstitute / gdctools

Python and UNIX CLI utilities to simplify interaction with the NIH/NCI Genomics Data Commons
Other
31 stars 4 forks source link

sample counts table needs to include project in cohort name #54

Closed noblem closed 6 years ago

noblem commented 6 years ago

Presently the sample counts files such as

/xchip/gdac_data/gdc/dice/TCGA/metadata/sample_counts.2017_08_09.tsv

do not indicate from which project they originate. So it would be helpful to have the cohort names (first column) in each of these reflect the project, too (e.g. COAD-TP becomes TCGA-COAD-TP), because this is how they are loaded into our workspaces and operated upon by our tasks, and reported in our dashboards, etc (not to mention that it globally disambiguates the cohort from all others)

noblem commented 6 years ago

David, I gave this to you bc I vaguely recall you having spelunked in this part of the code recently. But if you're not comfortable with doing this then I will.

dheiman commented 6 years ago

I heavily refactored this part of the code (consolidated all counting and count file generation to the dicer, and fixed bugs in aggregate counting). I'm about to test my update.