biocore / biom-format

The Biological Observation Matrix (BIOM) Format Project
http://biom-format.org
Other
89 stars 95 forks source link

Metadata order not preserved via biom export-metadata #907

Closed peterjc closed 1 year ago

peterjc commented 1 year ago

This produces a simple 4 sample, 10 observation test case with all observations having metadata for headers, shoulders, knees and toes (non-alphabetical order as per the song), and all samples having metadata lions, tigers, and bears (oh my - again, non-alphabetical ordering):

import numpy as np
from biom.table import Table

data = np.arange(40).reshape(10, 4)
sample_ids = ["S%d" % i for i in range(4)]
observ_ids = ["O%d" % i for i in range(10)]
table = Table(
    data,
    observ_ids,
    sample_ids,
    observation_metadata = [{"heads": _ + "i", "shoulders": _ + "ii", "knees": _ + "iii", "toes": _ + "iv"} for _ in observ_ids],
    sample_metadata = [{"lions": _ + "i", "tigers": _ + "ii", "bears": _ + "iii"} for _ in sample_ids],
    # table_id='Example Table',
    type="OTU Table"
)

from biom.util import biom_open

with biom_open("example-hdf5.biom", "w") as handle:
    table.to_hdf5(handle, generated_by="BIOM Pycode", compress=True)

with open("example-json.biom", "w") as handle:
    handle.write(table.to_json(generated_by="BIOM Pycode"))

The exported metadata does not respect the original order:

$ biom --version
biom, version 2.1.14
$ biom export-metadata -i example-json.biom --sample-metadata-fp /dev/stdout --observation-metadata-fp /dev/stdout
    bears   lions   tigers
S0  S0iii   S0i S0ii
S1  S1iii   S1i S1ii
S2  S2iii   S2i S2ii
S3  S3iii   S3i S3ii
    heads   knees   shoulders   toes
O0  O0i O0iii   O0ii    O0iv
O1  O1i O1iii   O1ii    O1iv
O2  O2i O2iii   O2ii    O2iv
O3  O3i O3iii   O3ii    O3iv
O4  O4i O4iii   O4ii    O4iv
O5  O5i O5iii   O5ii    O5iv
O6  O6i O6iii   O6ii    O6iv
O7  O7i O7iii   O7ii    O7iv
O8  O8i O8iii   O8ii    O8iv
O9  O9i O9iii   O9ii    O9iv

On the bright side, the table itself seems to respect the order, as does the raw JSON output - line breaks added by hand:

{"id": "None","format": "Biological Observation Matrix 1.0.0","format_url": "http://biom-format.org","matrix_type": "sparse","generated_by": "BIOM Pycode","date": "2023-03-13T10:36:04.063255","type": "OTU Table","matrix_element_type": "float","shape": [10, 4],
"data": [[0,1,1.0],[0,2,2.0],[0,3,3.0],[1,0,4.0],[1,1,5.0],[1,2,6.0],[1,3,7.0],[2,0,8.0],[2,1,9.0],[2,2,10.0],[2,3,11.0],[3,0,12.0],[3,1,13.0],[3,2,14.0],[3,3,15.0],[4,0,16.0],[4,1,17.0],[4,2,18.0],[4,3,19.0],[5,0,20.0],[5,1,21.0],[5,2,22.0],[5,3,23.0],[6,0,24.0],[6,1,25.0],[6,2,26.0],[6,3,27.0],[7,0,28.0],[7,1,29.0],[7,2,30.0],[7,3,31.0],[8,0,32.0],[8,1,33.0],[8,2,34.0],[8,3,35.0],[9,0,36.0],[9,1,37.0],[9,2,38.0],[9,3,39.0]],
"rows": [{"id": "O0", "metadata": {"heads": "O0i", "shoulders": "O0ii", "knees": "O0iii", "toes": "O0iv"}},{"id": "O1", "metadata": {"heads": "O1i", "shoulders": "O1ii", "knees": "O1iii", "toes": "O1iv"}},{"id": "O2", "metadata": {"heads": "O2i", "shoulders": "O2ii", "knees": "O2iii", "toes": "O2iv"}},{"id": "O3", "metadata": {"heads": "O3i", "shoulders": "O3ii", "knees": "O3iii", "toes": "O3iv"}},{"id": "O4", "metadata": {"heads": "O4i", "shoulders": "O4ii", "knees": "O4iii", "toes": "O4iv"}},{"id": "O5", "metadata": {"heads": "O5i", "shoulders": "O5ii", "knees": "O5iii", "toes": "O5iv"}},{"id": "O6", "metadata": {"heads": "O6i", "shoulders": "O6ii", "knees": "O6iii", "toes": "O6iv"}},{"id": "O7", "metadata": {"heads": "O7i", "shoulders": "O7ii", "knees": "O7iii", "toes": "O7iv"}},{"id": "O8", "metadata": {"heads": "O8i", "shoulders": "O8ii", "knees": "O8iii", "toes": "O8iv"}},{"id": "O9", "metadata": {"heads": "O9i", "shoulders": "O9ii", "knees": "O9iii", "toes": "O9iv"}}],
"columns": [{"id": "S0", "metadata": {"lions": "S0i", "tigers": "S0ii", "bears": "S0iii"}},{"id": "S1", "metadata": {"lions": "S1i", "tigers": "S1ii", "bears": "S1iii"}},{"id": "S2", "metadata": {"lions": "S2i", "tigers": "S2ii", "bears": "S2iii"}},{"id": "S3", "metadata": {"lions": "S3i", "tigers": "S3ii", "bears": "S3iii"}}]}

I'm not familiar enough with HDF5 to check that directly.

I think the bug is in Table method metadata_to_dataframe which uses sorted(...)

peterjc commented 1 year ago

Desired output (e.g. with #910):

$ biom export-metadata -i example-json.biom --sample-metadata-fp /dev/stdout --observation-metadata-fp /dev/stdout
    lions   tigers  bears
S0  S0i S0ii    S0iii
S1  S1i S1ii    S1iii
S2  S2i S2ii    S2iii
S3  S3i S3ii    S3iii
    heads   shoulders   knees   toes
O0  O0i O0ii    O0iii   O0iv
O1  O1i O1ii    O1iii   O1iv
O2  O2i O2ii    O2iii   O2iv
O3  O3i O3ii    O3iii   O3iv
O4  O4i O4ii    O4iii   O4iv
O5  O5i O5ii    O5iii   O5iv
O6  O6i O6ii    O6iii   O6iv
O7  O7i O7ii    O7iii   O7iv
O8  O8i O8ii    O8iii   O8iv
O9  O9i O9ii    O9iii   O9iv