HumanCellAtlas / table-testing

requirements, examples, and tests for expression matrix file formats
MIT License
22 stars 3 forks source link

Consider separating "formats" concerns into "container" vs "domain" #11

Open mckinsel opened 6 years ago

mckinsel commented 6 years ago

See the comment from @ryan-williams here:


I think this is a great addition as it currently stands, but have a high-level comment about the "formats" table.

Basically, there are two levels of "format" that are conflated:

I'm starting to chafe against the combinatorial blowup already, "stuffing database columns into filenames", e.g.:

Filename Description
ica_bone_marrow_h5.h5 10X HDF5
ica_bone_marrow.10x.16m.zarr 10x HDF5 layout, converted to zarr in 16MB chunks
ica_bone_marrow.10x.32m.zarr 10x HDF5 layout, converted to zarr in 32MB chunks
ica_bone_marrow.10x.64m.zarr 10x HDF5 layout, converted to zarr in 64MB chunks
ica_bone_marrow.h5ad Converted to AnnData's HDF5 format
ica_bone_marrow.ad.16m.zarr AnnData's HDF5 format in zarr w/ 16MB chunks
ica_bone_marrow.ad.32m.zarr AnnData's HDF5 format in zarr w/ 32MB chunks
ica_bone_marrow.ad.64m.zarr AnnData's HDF5 format in zarr w/ 64MB chunks

These all describe the same data, and can be losslessly converted between one another. (Sorry, I really need to get this stuff into a public bucket and stable code pointer; will do asap, cf. #9)

So we'll have to decide how we want to deal with this combo-explosion.

My guess is that we can hackily stuff this info into filenames like the above for now, because our goal is mostly to settle on one or a few types of vetted formats+params that will be used more broadly, and in those broader settings ppl won't have to worry about the exponential parameter-space.

OTOH, if there's appetite to take a principled approach to provenance and put all this metadata in [a database that we route accesses to the data through] (presumably this is a focus in larger HCA-land), I'm interested in helping with that too!

Either way, I think that table looks in my mind roughly like {loom, anndata, 10x} x {zarr, hdf5, n5, parquet} x {a few best guesses at additional parameters for each one, like you have here}

ryan-williams commented 6 years ago

Thanks for filing this @mckinsel.

One note: on the starfish call yesterday we decided the right term for the second level is probably "schema".

Trying to make the analogy to a database more explicit, fwiw:

HDF5/Zarr Database Notes
Group Database Contains several Datasets/Tables, each keyed by a string name
Dataset Table Contains an N-D array: N Dimensions/Columns, and many Entries / Rows
Dimension Column Has a name and data type
Entry Row A tuple with one value for each Dimension/Column

I suppose even "schema" is a bit overloaded, because it usually refers to a table's dimensions' names+types, whereas the "domain format" level discussed here encompasses "what tables exist in a database, and what are their schemas?"