As a general comment, while I think it is definitely good for the ML group to have a single dataset that everyone is working on, restricting it like this may not be the optimal solution. Eventually the data will need to be more fluid and subset on the fly depending on different rules which we will need to define later. (e.g. Unsupervised feature construction should not remove gene expression samples that don't have mutation status)
The motivation of this pull request is that Cognoma is producing the most user-friendly data. We should export the complete datasets to enable many applications. Currently, I'm not planning to upload this data to figshare, but we could (especially once we continuously integrate), but users will be able to generate it for themselves.
@gwaygenomics has previously brought up the need to export our processed datasets for all observations. https://github.com/cognoma/cancer-data/pull/20#issuecomment-242408331:
The motivation of this pull request is that Cognoma is producing the most user-friendly data. We should export the complete datasets to enable many applications. Currently, I'm not planning to upload this data to figshare, but we could (especially once we continuously integrate), but users will be able to generate it for themselves.