biocore / biom-format

The Biological Observation Matrix (BIOM) Format Project
http://biom-format.org
Other
89 stars 95 forks source link

TypeError: the JSON object must be str, bytes or bytearray, not File #965

Open thomasstjerne opened 2 months ago

thomasstjerne commented 2 months ago

Hi I am try to convert a hdf5 biom 2.1 file using biom convert -i data.biom -o table.from_biom.txt --to-tsv

I checked that the file validates:

(.venv) me@12345 biom-test % biom validate-table -i data.biom                           

The input file is a valid BIOM-formatted file.

But I get the following error: TypeError: the JSON object must be str, bytes or bytearray, not File

Any ideas? Full stack trace blow:

(.venv) me@12345 biom-test % biom convert -i data.biom -o table.from_biom.txt --to-tsv
Traceback (most recent call last):
  File "/Users/me/biom-test/.venv/lib/python3.12/site-packages/biom/parse.py", line 668, in load_table
    table = parse_biom_table(fp)
            ^^^^^^^^^^^^^^^^^^^^
  File "/Users/me/biom-test/.venv/lib/python3.12/site-packages/biom/parse.py", line 422, in parse_biom_table
    t = Table.from_json(json.loads(file_obj),
                        ^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.12/3.12.2_1/Frameworks/Python.framework/Versions/3.12/lib/python3.12/json/__init__.py", line 339, in loads
    raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not File

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/me/biom-test/.venv/bin/biom", line 8, in <module>
    sys.exit(cli())
             ^^^^^
  File "/Users/me/biom-test/.venv/lib/python3.12/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/me/biom-test/.venv/lib/python3.12/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/Users/me/biom-test/.venv/lib/python3.12/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/me/biom-test/.venv/lib/python3.12/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/me/biom-test/.venv/lib/python3.12/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/me/biom-test/.venv/lib/python3.12/site-packages/biom/cli/table_converter.py", line 113, in convert
    table = load_table(input_fp)
            ^^^^^^^^^^^^^^^^^^^^
  File "/Users/me/biom-test/.venv/lib/python3.12/site-packages/biom/parse.py", line 670, in load_table
    raise TypeError("%s does not appear to be a BIOM file!" % f)
TypeError: data.biom does not appear to be a BIOM file!
wasade commented 2 months ago

Thanks @thomasstjerne! That's odd, is there any chance the file could be uploaded? Just as a heads up, I'm out of the office today so may be delayed but I'll try and look as quick as possible

thomasstjerne commented 2 months ago

Thanks for the quick reply @wasade . I have made the file accessible here: https://labs.gbif.org/~tsjeppesen/data.biom

thomasstjerne commented 2 months ago

I have debugged a bit further myself, and it seems that it is a dataset: sample/group-metadata/default_values which stores JSON strings. The dataset the data_type attribute set to json. This should be valid I guess?

Screenshot at May 06 11-42-32
wasade commented 2 months ago

Thank you, @thomasstjerne. How was this file created? It looks like the default_values dataset is empty and has no shape:

>>> x = f['sample/group-metadata/default_values']
>>> x
<HDF5 dataset "default_values": shape (), type "|S22">
>>> x.size
1
>>>

The failure itself is occurring here as the dataset is empty, and that edge case isn't properly being tested for.

The validator allows this as the validator is just looking at the structure of the file, not examining the contents. The contents appear malformed as there doesn't appear to be information associated with default_values, although I do think it would be far better for biom to be informative on the nature of the issue as it is currently misleading

thomasstjerne commented 2 months ago

Thank you @wasade. It is created using https://github.com/usnistgov/h5wasm for a node.js web application for metabarcoding data. When I inspect the file using https://myhdf5.hdfgroup.org/ i see a single string {"target_gene":"ITS2"}. I was trying to follow the example in the docs where it says:

One example of such group metadata dataset is observation/group-metadata/phylogeny, with the attribute observation/group-metadata/phylogeny.attrs['data_type'] = "newick", which stores a single string with the newick format of the phylogenetic tree for the observations.

Wonder if there is an example file somewhere with such a phylogeny I could look into?

Thanks again for your time.

wasade commented 2 months ago

Hi @thomasstjerne,

Thanks! I see the data now, it's a bit awkward to access with h5py:

>>> import h5py
>>> f = h5py.File('data.biom')
>>> f['sample/group-metadata/default_values'][()]
b'{"target_gene":"ITS2"}'

We currently assume that the group-metadata are setup as non-scalar data. In this case, I think it would be a dataset with a shape of (1, ). I don't have a great example though as, to be honest, I'm not aware of active uses of these components of the format. It does seem the format description is less than precise on this. The parsing logic would need to be adjusted to account for the scalar case. Is that essential for your needs?

wasade commented 2 months ago

Our tests for group-metadata currently assume a dataset with a shape. An example can be found here. However, round tripping against to_hdf5 with these data set is not working. I will try to resolve this before the next micro release, but it's going to be a tight next window.

Is it possible to share the javascript code which uses h5wasm that created the table?


>>> import h5py
>>> import biom
>>> t = biom.load_table('biom/tests/test_data/test_grp_metadata.biom')
>>> f = h5py.File('asd.biom', 'w')
>>> t.to_hdf5(f, 'asd')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/dtmcdonald/ResearchWork/software/biom-format/biom/table.py", line 4627, in to_hdf5
    datatype, val = value
ValueError: too many values to unpack (expected 2)
>>>
thomasstjerne commented 2 months ago

Thanks again @wasade I have got it working using dataset with shape 1, and is not at all essential for me to use a scalar. The javascript code is available here https://github.com/gbif/edna-tool-backend/blob/main/converters/hdf5.js - but it is part of a larger code base that works as a restful webserver. There is a separate UI project working on top of it. Be ware that I committed the fix for using a shape 1 dataset rather than the scalar, so you would need to go a commit or two back in history.

I am happy to do an isolated code snippet for just the hdf5 conversion if this is helpful for you?

wasade commented 2 months ago

@thomasstjerne, that is super cool, thank you for sharing! It seems like it may be a basis for a general-purpose biom Table object in javascript too!

Since there is a work around here, I'm going to keep this issue open and defer resolution for a future release.