Open mckinsel opened 6 years ago
Thanks for filing this @mckinsel.
One note: on the starfish call yesterday we decided the right term for the second level is probably "schema".
Trying to make the analogy to a database more explicit, fwiw:
HDF5/Zarr | Database | Notes |
---|---|---|
Group | Database | Contains several Datasets/Tables, each keyed by a string name |
Dataset | Table | Contains an N-D array: N Dimensions/Columns, and many Entries / Rows |
Dimension | Column | Has a name and data type |
Entry | Row | A tuple with one value for each Dimension/Column |
I suppose even "schema" is a bit overloaded, because it usually refers to a table's dimensions' names+types, whereas the "domain format" level discussed here encompasses "what tables exist in a database, and what are their schemas?"
See the comment from @ryan-williams here:
I think this is a great addition as it currently stands, but have a high-level comment about the "formats" table.
Basically, there are two levels of "format" that are conflated:
I'm starting to chafe against the combinatorial blowup already, "stuffing database columns into filenames", e.g.:
These all describe the same data, and can be losslessly converted between one another. (Sorry, I really need to get this stuff into a public bucket and stable code pointer; will do asap, cf. #9)
So we'll have to decide how we want to deal with this combo-explosion.
My guess is that we can hackily stuff this info into filenames like the above for now, because our goal is mostly to settle on one or a few types of vetted formats+params that will be used more broadly, and in those broader settings ppl won't have to worry about the exponential parameter-space.
OTOH, if there's appetite to take a principled approach to provenance and put all this metadata in [a database that we route accesses to the data through] (presumably this is a focus in larger HCA-land), I'm interested in helping with that too!
Either way, I think that table looks in my mind roughly like {loom, anndata, 10x} x {zarr, hdf5, n5, parquet} x {a few best guesses at additional parameters for each one, like you have here}