For studies with a large number of attributes it would be nice if we could split the data files. Provide a few files with a subset of patients with a different subsets of columns. The importer would then fill the complete sample-attribute matrix in memory and leave the unfilled ones with a missing value (NA would do I guess).
Currently, the data file for samples (or patients) needs to contain all the data - all the rows and columns. It would be useful to feed the importer multiple data files that contain a subset of rows and columns.
Instead of having one file:
id
attribute1
attribute2
sample1
A
B
sample2
C
NA
We could have files
id
attribute1
sample1
A
id
attribute2
sample1
B
id
attribute1
sample2
C
Advantages:
Working with files with 100s of columns is complicated
diffs are nearly useless so when using git to commit data files for version control it's not much better than a binary blob
viewing/editing them by hand is tricky, it can crash shittier tools
Pipelines could produce files covering different attributes that could be directly imported
Samples coming from different sources could stay in different files (useful for GENIE, too)
It would be easy to have separate studies for subcohorts/subsets of data with exactly the same data files
Disadvantages
Importer gets more complicated, validation error messages get more complicated
Typo in a attribute name means there are now to attributes and a bunch of missing values
Errors that currently get caught with "not all lines have the same number of columns" might pass validation and result in a bunch of missing values instead
For studies with a large number of attributes it would be nice if we could split the data files. Provide a few files with a subset of patients with a different subsets of columns. The importer would then fill the complete sample-attribute matrix in memory and leave the unfilled ones with a missing value (NA would do I guess).
Currently, the data file for samples (or patients) needs to contain all the data - all the rows and columns. It would be useful to feed the importer multiple data files that contain a subset of rows and columns.
Instead of having one file:
We could have files
Advantages:
Disadvantages