biocore / biom-format

The Biological Observation Matrix (BIOM) Format Project
http://biom-format.org
Other
92 stars 95 forks source link

inconsistent metadata representation between JSON and HDF5 #594

Closed jairideout closed 6 years ago

jairideout commented 9 years ago

Metadata is handled differently depending on underlying file format (JSON or HDF5).

This is related to a previous issue (#585) and fix (#589). The original issue occurred in QIIME's rarefaction unit tests (https://github.com/biocore/qiime/issues/1918).

Example:

Create an in-memory table with metadata as a list of empty dictionaries. Write this table as JSON and HDF5. Read the two tables back into memory and compare to the original table. The JSON table is equal to the in-memory table, but the HDF5 table is not because the metadata differ (None vs. a list of defaultdicts):

In [1]: from biom.table import Table

In [2]: import numpy as np

In [3]: t = Table(np.array([[2,1,0],[0,5,0],[0,3,0],[1,2,0]]), list('bacd'), list('YXZ'), observation_metadata=[{}, {}, {}, {}], sample_metadata=[{}, {}, {}])

In [4]: with open('json.biom', 'w') as f:
   ...:     t.to_json('me', f)
   ...:

In [5]: from biom.util import biom_open

In [6]: with biom_open('hdf5.biom', 'w') as f:
   ...:     t.to_hdf5(f, 'me', True)
   ...:

In [7]: from biom import load_table

In [8]: json_table = load_table('json.biom')

In [9]: hdf5_table = load_table('hdf5.biom')

In [10]: json_table.descriptive_equality(t)
Out[10]: 'Tables appear equal'

In [11]: hdf5_table.descriptive_equality(t)
Out[11]: 'Observation metadata are not the same'

cc @josenavas @gregcaporaso @Jorge-C

wasade commented 9 years ago

Ping on this. We're doing a release for #599 relatively soon, so it would be good to lump in other bug fixes if they're attainable on short order

wasade commented 8 years ago

I'm really not sure what the best solution is here. I think the JSON formatter is actually incorrect as there isn't metadata to store since the dicts are empty. We don't have a way to represent this in HDF5 as the metadata are datasets named by their key, which doesn't exist.

I see two options here, either add a check into the JSON formatter for this edge case, or add a check into the constraints on table metadata such that, if all data are empty, that we set the metadata on an axis to be None. The latter is kind of nice but without immutability, I don't know if we can actually enforce it.