HumanCellAtlas / table-testing

requirements, examples, and tests for expression matrix file formats
MIT License
22 stars 3 forks source link

Add documentation #8

Closed mckinsel closed 6 years ago

mckinsel commented 6 years ago

Describe the test data, how the tests work, and how to contribute.

ryan-williams commented 6 years ago

I think this is a great addition as it currently stands, but have a high-level comment about the "formats" table.

Basically, there are two levels of "format" that are conflated:

I'm starting to chafe against the combinatorial blowup already, "stuffing database columns into filenames", e.g.:

Filename Description
ica_bone_marrow_h5.h5 10X HDF5
ica_bone_marrow.10x.16m.zarr 10x HDF5 layout, converted to zarr in 16MB chunks
ica_bone_marrow.10x.32m.zarr 10x HDF5 layout, converted to zarr in 32MB chunks
ica_bone_marrow.10x.64m.zarr 10x HDF5 layout, converted to zarr in 64MB chunks
ica_bone_marrow.h5ad Converted to AnnData's HDF5 format
ica_bone_marrow.ad.16m.zarr AnnData's HDF5 format in zarr w/ 16MB chunks
ica_bone_marrow.ad.32m.zarr AnnData's HDF5 format in zarr w/ 32MB chunks
ica_bone_marrow.ad.64m.zarr AnnData's HDF5 format in zarr w/ 64MB chunks

These all describe the same data, and can be losslessly converted between one another. (Sorry, I really need to get this stuff into a public bucket and stable code pointer; will do asap, cf. #9)

So we'll have to decide how we want to deal with this combo-explosion.

My guess is that we can hackily stuff this info into filenames like the above for now, because our goal is mostly to settle on one or a few types of vetted formats+params that will be used more broadly, and in those broader settings ppl won't have to worry about the exponential parameter-space.

OTOH, if there's appetite to take a principled approach to provenance and put all this metadata in [a database that we route accesses to the data through] (presumably this is a focus in larger HCA-land), I'm interested in helping with that too!

Either way, I think that table looks in my mind roughly like {loom, anndata, 10x} x {zarr, hdf5, n5, parquet} x {a few best guesses at additional parameters for each one, like you have here}