Closed mckinsel closed 6 years ago
I think this is a great addition as it currently stands, but have a high-level comment about the "formats" table.
Basically, there are two levels of "format" that are conflated:
I'm starting to chafe against the combinatorial blowup already, "stuffing database columns into filenames", e.g.:
Filename | Description |
---|---|
ica_bone_marrow_h5.h5 | 10X HDF5 |
ica_bone_marrow.10x.16m.zarr | 10x HDF5 layout, converted to zarr in 16MB chunks |
ica_bone_marrow.10x.32m.zarr | 10x HDF5 layout, converted to zarr in 32MB chunks |
ica_bone_marrow.10x.64m.zarr | 10x HDF5 layout, converted to zarr in 64MB chunks |
ica_bone_marrow.h5ad | Converted to AnnData's HDF5 format |
ica_bone_marrow.ad.16m.zarr | AnnData's HDF5 format in zarr w/ 16MB chunks |
ica_bone_marrow.ad.32m.zarr | AnnData's HDF5 format in zarr w/ 32MB chunks |
ica_bone_marrow.ad.64m.zarr | AnnData's HDF5 format in zarr w/ 64MB chunks |
These all describe the same data, and can be losslessly converted between one another. (Sorry, I really need to get this stuff into a public bucket and stable code pointer; will do asap, cf. #9)
So we'll have to decide how we want to deal with this combo-explosion.
My guess is that we can hackily stuff this info into filenames like the above for now, because our goal is mostly to settle on one or a few types of vetted formats+params that will be used more broadly, and in those broader settings ppl won't have to worry about the exponential parameter-space.
OTOH, if there's appetite to take a principled approach to provenance and put all this metadata in [a database that we route accesses to the data through] (presumably this is a focus in larger HCA-land), I'm interested in helping with that too!
Either way, I think that table looks in my mind roughly like {loom, anndata, 10x} x {zarr, hdf5, n5, parquet} x {a few best guesses at additional parameters for each one, like you have here}
Describe the test data, how the tests work, and how to contribute.