biocore / biom-format

The Biological Observation Matrix (BIOM) Format Project
http://biom-format.org
Other
89 stars 95 forks source link

Required attribute table type not checked consistently #903

Open peterjc opened 1 year ago

peterjc commented 1 year ago

Reproducible example

Example code based on https://biom-format.org/documentation/table_objects.html#examples

import numpy as np
from biom.table import Table

data = np.arange(40).reshape(10, 4)
sample_ids = ["S%d" % i for i in range(4)]
observ_ids = ["O%d" % i for i in range(10)]
table = Table(
    data,
    observ_ids,
    sample_ids,
    # observ_metadata,
    # sample_metadata,
    # table_id='Example Table'
)

from biom.util import biom_open

with biom_open("example-hdf5.biom", "w") as handle:
    table.to_hdf5(handle, generated_by="BIOM Pycode", compress=True)

with open("example-json.biom", "w") as handle:
    handle.write(table.to_json(generated_by="BIOM Pycode"))

Follow this by command line validation of the output:

$ biom validate-table -i example-hdf5.biom 
$ biom validate-table -i example-json.biom 

Actual behaviour:

Python script runs without error (bad).

HDF5 file passess validation (bad):

$ biom validate-table -i example-hdf5.biom 

The input file is a valid BIOM-formatted file.

JSON file fails validation (good):

$ biom validate-table -i example-json.biom 
Unknown table type, however that is likely okay.
The input file is not a valid BIOM-formatted file.

Expected behaviour:

Runtime error during Table.__init__ since defaults include type=None and validate=True by default, and for all BIOM formats to date, type is a required top level attribute.

https://biom-format.org/documentation/format_versions/biom-1.0.html https://biom-format.org/documentation/format_versions/biom-2.0.html https://biom-format.org/documentation/format_versions/biom-2.1.html

Furthermore, using the example BIOM files as generated without the table type, both the JSON and the HDF5 ought to fail consistently.

wasade commented 1 year ago

Hm, this is going to be a delicate one. If type=None raises, it at a minimum triggers a minor release as this is breaking behavior for the current API and it will create headaches for many users of the library. I'm inclined to, if None, to default to "OTU table" and raise a warning with a deprecation notice that this behavior will be unsupported in the future.

That is correct the validation should be consistent -- good catch.

peterjc commented 1 year ago

I agree that's a good plan - add a deprecation warning in a revision release of the tool, and upgrade to an exception in a minor release.

At least fixing the HDF5 validation to be stricter shouldn't be so complicated 😃

I found this bug because (following the examples), I'd not set the table type myself. In my case "OUT table" is the best match from the options defined.

wasade commented 1 year ago

Ah yes the classic "OUT table", which coincidentally is a common interpretation by Word :)

gregcaporaso commented 1 year ago

In some offline discussion with @wasade, he suggested that maybe we could grandfather type=None in as allowable by the format. I'm in favor of that as it hasn't caused any problems to-date (that I'm aware of). Changing the behavior would cause other tools to need updating (e.g., QIIME 2, which doesn't set this value - see below), and would make old biom-formatted tables not work with new versions of the software.

In [1]: import qiime2
In [2]: import biom
In [3]: t = qiime2.Artifact.load('table.qza').view(biom.Table)
In [5]: print(t.type)
None
wasade commented 1 year ago

Thank you, @gregcaporaso!

peterjc commented 6 months ago

Also using biom.Table.from_tsv(...) does not have a type= argument to set the table type.