Open pbuttigieg opened 5 years ago
Relevant documentation from the BIOM spec:
BIOM seems to accept a basic [T,C]SV formatted table for their metadata slot. Haven't seen any other other serialisation options.
The CLI accepts TSV only, but it seems that without passing any other parameters (--float-fields
, --sc-separated
) all metadata values are parsed as strings (even if they are arrays). So I'm not sure if TSV is the best option as you would need metadata about the metadata to be able to properly add it to a BIOM file. I'm assuming that metadata will look very different between different use cases. JSON would solve this problem, but then you would need to wrap biom add-metadata
or add JSON support to the CLI (relevant code sections here and here).
Thanks @pieterprovoost
It does seem that this slot in the BIOM format is a bit weak. @jdeck88 @cuttlefishh - would you know if the BIOM developers would be able to add some JSON support here?
xref to: https://github.com/GLOMICON/asvBiomXchange/issues/2#issuecomment-526321930
BIOM format was originally JSON. They switched to HDF5 ~2013 because of the significant reductions in file sizes with HDF5 due to the way it handles null values, which are common in sparse datasets like ASV observation tables. From this perspective, I would guess that adding JSON support is not likely. Would the whole table have to be in JSON format? In that case, one would lose the advantages of HDF5.
I keep my metadata separate from my BIOM tables, for a few reasons:
Luke
File formats used by QIIME 2 (an update from my previous post):
Observation table: BIOM file encoded as QIIME 2 archive (.qza) Sample metadata: Tab-separated values text file (.tsv) Feature metadata (taxonomy): Tab-separated values text file (.tsv), with columns "Feature ID", "Taxon", "Confidence", encoded as QIIME 2 archive (.qza)
Thanks @cuttlefishh
We should ping the devs to see if they have any interest / capacity in updating their metadata handling
Machine readability is key here, so something like JSON or a very structured / strictly controlled CSV would be an asset.
If the HDF5 savings are so significant, then I suppose it's best to stick with that for the OTU tables.
Hi Daniel (@wasade), would you be able to comment on BIOM-format's metadata functionality, specifically whether it could support multiple data types, possibly through JSON? Please see the thread above for reference. Thanks! Luke
+1 for keeping the metadata in a separate file. I would prefer using character separated values (csv) format which is more of a standard than tab separated values, but that is a minor distinction. Column headers defined as community standard vocabulary (e.g. Darwin core terms) will make the files easy to use/interpret. I'm not as familiar with ASV formats but HDF5 sounds very useful to keep file sizes low.
John
On Fri, Aug 30, 2019 at 4:49 PM Luke Thompson notifications@github.com wrote:
Hi Daniel (@wasade https://github.com/wasade), would you be able to comment on BIOM-format's metadata functionality, specifically whether it could support multiple data types, possibly through JSON? Please see the thread above for reference. Thanks! Luke
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/GLOMICON/asvBiomXchange/issues/8?email_source=notifications&email_token=AAIZ3RIILLUSUSQATYBGFD3QHGW2HA5CNFSM4IBIOIG2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5TAMUQ#issuecomment-526779986, or mute the thread https://github.com/notifications/unsubscribe-auth/AAIZ3RNXV4R7V7WTMSAVBZDQHGW2HANCNFSM4IBIOIGQ .
-- John Deck (541) 914-4739
What about encapsulating this project into redbiom? I think that effort covers many of the stated aims of the Xchange from what I'm seeing on the readme. redbiom
is (a) permissive in the metadata you can index (b) allows for metadata extraction as TSV or CSV (c) naturally represents the taxonomy (d) the table data can be extracted as BIOM v2.1.0 (HDF5) and (e) allows for representing preparation specific processing. It's highly scalable as well -- we can index hundreds of thousands of samples from Qiita in about 30GB.
This looks like a good idea. To build the workflow around redbiom it would need to have a solid API... is that readily available?
On Tue, Sep 3, 2019 at 9:30 AM Daniel McDonald notifications@github.com wrote:
What about encapsulating this project into redbiom https://github.com/biocore/redbiom? I think that effort covers many of the stated aims of the Xchange from what I'm seeing on the readme. redbiom is (a) permissive in the metadata you can index (b) allows for metadata extraction as TSV or CSV (c) naturally represents the taxonomy (d) the table data can be extracted as BIOM v2.1.0 (HDF5) and (e) allows for representing preparation specific processing. It's highly scalable as well -- we can index hundreds of thousands of samples from Qiita in about 30GB.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/GLOMICON/asvBiomXchange/issues/8?email_source=notifications&email_token=AAIZ3RMC4AX5OXWPYQJYS73QH2GKZA5CNFSM4IBIOIG2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5YZAZA#issuecomment-527536228, or mute the thread https://github.com/notifications/unsubscribe-auth/AAIZ3RKNQ7E3LMHHWT4VI53QH2GKZANCNFSM4IBIOIGQ .
-- John Deck (541) 914-4739
The description of redbiom
is published here.
A QIIME 2 oriented tutorial is here highlighting interaction on the command line.
The Python API is not formally described outside of the code at the moment, but all API methods conform to numpydoc standards (e.g., redbiom.fetch.data_from_samples
) so rendering with Sphinx as we do for BIOM-Format and scikit-bio would be relatively straight forward. The command line interface is actually just a thin wrapper around the API methods.
Interaction with the database is performed indirectly through a RESTful API provided by webdis. The API is just the Redis database commands though. The Redis commands and key structures are part of the documentation as well for the methods which interact with Redis (e.g., in the private method redbiom.fetch._biom_from_samples
). The data model for Redis is laid out as well in the readme. What this allows for, though does not exist but which would be awesome, is a Javascript layer of interaction with the RESTful API.
I'm eager to grow redbiom
too and happy to work with the team here to expand functionality, documentation, etc where necessary.
And, just a minor aside, if you're going to the project page right now it will indicate the build is presently failing on py36 (not py35 or py27). This is due to an output order sensitivity which is on my list of things to address.
Picking up from #2, the metadata either bundled in the BIOM file or accessed via a IRI should have a standardised format.
We'll use this issue to define that format (likely JSON, if BIOM doesn't restrict us).
@jdeck88