In the process of supporting new kinds of data for CPTAC, and new-format SEG files in GDC data release 11, it's become clear that the current approach for dicing, and in particular the way the "convert" function is identified & called during dicing needs improvement:
it's fragile that gdc_dice fails entirely & aborts when it doesn't recognize a data type; instead it should just skip the file and continue to process all the other files that it can
it's also fragile that there are assumptions embedded into tcga_id() and related functions, e.g. that only 1 sample/case can belong in certain files (which is not true, as we've learned in GDC data release 11, for Biotab data, which can contain multiple IDs attached to a given annotation, etc)
when supporting new data types/formats, the chain of "what needs to be touched, in what order" is not made clear in the code or docs (i.e. the annotation table, the gdc_dice code, the lib/convert functions, etc)
it's possible that we can reduce the # of layers in how these converters are identified & called
the dialect parameter of the convert routines is not passed from gdc_dice to the converter at run time
the dialect could probably be inferred from either the file extension OR by inspecting the content
(e.g. this is how "new" seg files are identified from GDC, because as of data release 11 the "Sample" column is now named "GDC_Aliquot" ... but for everything else the converter is the same, so it's simpler to make the converters smarter than to add entries to the annotations table-the process of which is murky and trial/error prone to mistake)
This is a roundup of concerns, and as we work them the above list will likely need to be corrected or extended, but for now it's sufficient to have written most of them down in summary form so that they don't fall through any cracks
In the process of supporting new kinds of data for CPTAC, and new-format SEG files in GDC data release 11, it's become clear that the current approach for dicing, and in particular the way the "convert" function is identified & called during dicing needs improvement:
This is a roundup of concerns, and as we work them the above list will likely need to be corrected or extended, but for now it's sufficient to have written most of them down in summary form so that they don't fall through any cracks