biocore / biom-format

The Biological Observation Matrix (BIOM) Format Project
http://biom-format.org
Other
90 stars 95 forks source link

to_hdf5 cannot handle None's in metadata #609

Open wasade opened 9 years ago

wasade commented 9 years ago

This table:

> t.metadata()
Out[5]:
(defaultdict(<function <lambda> at 0x10d595848>, {u'pH': 7.0}),
 defaultdict(<function <lambda> at 0x10d5958c0>, {u'pH': 8.0}),
 defaultdict(<function <lambda> at 0x10d595938>, {u'pH': 7.0}),
 defaultdict(<function <lambda> at 0x10d5959b0>, {}))

Will cause to_hdf5 to except. It is valid for to_json.

ElDeveloper commented 9 years ago

:+1: just found this in a table I'm working on, it is forever to be stored in 1.0 :worried:

wasade commented 9 years ago

:(

ElDeveloper commented 9 years ago

In the meantime I've %s/null/""/g ... not ideal but works.

On (Mar-30-15|18:04), Daniel McDonald wrote:

:(


Reply to this email directly or view it on GitHub: https://github.com/biocore/biom-format/issues/609#issuecomment-87891128

Jorge-C commented 9 years ago

Unfortunately this is a big question: how to serialize missing data.

We can't serialize None (a Python object) through hdf5, so we need to choose a value that represents missing data in a way that round trips safely. nan could potentially work for float fields, but not for integer- or string-fields. Another option that comes to mind is to save another array that marks whether the corresponding value is missing or not (à la masked arrays from numpy).

Jorge-C commented 9 years ago

Actually, I don't think we're using masked arrays in any of our projects, maybe we should look deeper into them.

wasade commented 7 years ago

We need better enforcement surrounding metadata... a reserved word for indicating a null entry for HDF5, but it would need to be defined in the spec itself, which would trigger a change to format version 2.1.1, which is not ideal. The masked arrays would really need to trigger a 2.2.0 format as that would be defining a separate dataset.

All paths are not fun -- I think the best direction is to, at write, detect nulls like this such that in the original example, the data are implicitly transformed to:

> t.metadata()
Out[5]:
(defaultdict(<function <lambda> at 0x10d595848>, {u'pH': 7.0}),
 defaultdict(<function <lambda> at 0x10d5958c0>, {u'pH': 8.0}),
 defaultdict(<function <lambda> at 0x10d595938>, {u'pH': 7.0}),
 defaultdict(<function <lambda> at 0x10d5959b0>, {u'pH': None}))

...which I believe would fly for the format and spec.

josenavas commented 7 years ago

I think that works @wasade and shouldn't need a new format/spec

wasade commented 7 years ago

Deferring to 2.2 as this is also lumped into the grand ol' refactor of the formatters and parsers. It would be nice to defer the type detection as well back to pandas as this could get nasty fast.