Open wasade opened 9 years ago
:+1: just found this in a table I'm working on, it is forever to be stored in 1.0 :worried:
:(
In the meantime I've %s/null/""/g
... not ideal but works.
On (Mar-30-15|18:04), Daniel McDonald wrote:
:(
Reply to this email directly or view it on GitHub: https://github.com/biocore/biom-format/issues/609#issuecomment-87891128
Unfortunately this is a big question: how to serialize missing data.
We can't serialize None
(a Python object) through hdf5, so we need to choose a value that represents missing data in a way that round trips safely. nan
could potentially work for float fields, but not for integer- or string-fields. Another option that comes to mind is to save another array that marks whether the corresponding value is missing or not (à la masked arrays from numpy).
Actually, I don't think we're using masked arrays in any of our projects, maybe we should look deeper into them.
We need better enforcement surrounding metadata... a reserved word for indicating a null entry for HDF5, but it would need to be defined in the spec itself, which would trigger a change to format version 2.1.1, which is not ideal. The masked arrays would really need to trigger a 2.2.0 format as that would be defining a separate dataset.
All paths are not fun -- I think the best direction is to, at write, detect nulls like this such that in the original example, the data are implicitly transformed to:
> t.metadata()
Out[5]:
(defaultdict(<function <lambda> at 0x10d595848>, {u'pH': 7.0}),
defaultdict(<function <lambda> at 0x10d5958c0>, {u'pH': 8.0}),
defaultdict(<function <lambda> at 0x10d595938>, {u'pH': 7.0}),
defaultdict(<function <lambda> at 0x10d5959b0>, {u'pH': None}))
...which I believe would fly for the format and spec.
I think that works @wasade and shouldn't need a new format/spec
Deferring to 2.2 as this is also lumped into the grand ol' refactor of the formatters and parsers. It would be nice to defer the type detection as well back to pandas as this could get nasty fast.
This table:
Will cause
to_hdf5
to except. It is valid forto_json
.