GLOMICON / asvBiomXchange

A repository to develop an exchange format for molecular biodiversity data
1 stars 4 forks source link

Define metadata exchange format #8

Open pbuttigieg opened 4 years ago

pbuttigieg commented 4 years ago

Picking up from #2, the metadata either bundled in the BIOM file or accessed via a IRI should have a standardised format.

We'll use this issue to define that format (likely JSON, if BIOM doesn't restrict us).

@jdeck88

pbuttigieg commented 4 years ago

Relevant documentation from the BIOM spec:

http://biom-format.org/documentation/adding_metadata.html

pbuttigieg commented 4 years ago

BIOM seems to accept a basic [T,C]SV formatted table for their metadata slot. Haven't seen any other other serialisation options.

pieterprovoost commented 4 years ago

The CLI accepts TSV only, but it seems that without passing any other parameters (--float-fields, --sc-separated) all metadata values are parsed as strings (even if they are arrays). So I'm not sure if TSV is the best option as you would need metadata about the metadata to be able to properly add it to a BIOM file. I'm assuming that metadata will look very different between different use cases. JSON would solve this problem, but then you would need to wrap biom add-metadata or add JSON support to the CLI (relevant code sections here and here).

pbuttigieg commented 4 years ago

Thanks @pieterprovoost

It does seem that this slot in the BIOM format is a bit weak. @jdeck88 @cuttlefishh - would you know if the BIOM developers would be able to add some JSON support here?

xref to: https://github.com/GLOMICON/asvBiomXchange/issues/2#issuecomment-526321930

cuttlefishh commented 4 years ago

BIOM format was originally JSON. They switched to HDF5 ~2013 because of the significant reductions in file sizes with HDF5 due to the way it handles null values, which are common in sparse datasets like ASV observation tables. From this perspective, I would guess that adding JSON support is not likely. Would the whole table have to be in JSON format? In that case, one would lose the advantages of HDF5.

I keep my metadata separate from my BIOM tables, for a few reasons:

  1. BIOM doesn't seem well-suited to handling metadata, as noted by @pieterprovoost with the way the metadata are parsed. Treating everything as strings is the safest way to preserve metadata because it avoids inferring data types, but this reduces the functionality of the metadata.
  2. I like to keep my metadata as a text file (TSV) so I can edit it easily and look at it with tabview or Excel. Unlike ASV data, metadata is not sparse and it doesn't create large file sizes, so the advantages of putting it into a compressed format seem minimal except that this would give you all the project data (observation data + metadata) in the same file.
  3. I analyze my ASV data with QIIME 2, and QIIME 2 prefers metadata to be in a separate file (TSV).

Luke

cuttlefishh commented 4 years ago

File formats used by QIIME 2 (an update from my previous post):

Observation table: BIOM file encoded as QIIME 2 archive (.qza) Sample metadata: Tab-separated values text file (.tsv) Feature metadata (taxonomy): Tab-separated values text file (.tsv), with columns "Feature ID", "Taxon", "Confidence", encoded as QIIME 2 archive (.qza)

pbuttigieg commented 4 years ago

Thanks @cuttlefishh

We should ping the devs to see if they have any interest / capacity in updating their metadata handling

Machine readability is key here, so something like JSON or a very structured / strictly controlled CSV would be an asset.

If the HDF5 savings are so significant, then I suppose it's best to stick with that for the OTU tables.

cuttlefishh commented 4 years ago

Hi Daniel (@wasade), would you be able to comment on BIOM-format's metadata functionality, specifically whether it could support multiple data types, possibly through JSON? Please see the thread above for reference. Thanks! Luke

jdeck88 commented 4 years ago

+1 for keeping the metadata in a separate file. I would prefer using character separated values (csv) format which is more of a standard than tab separated values, but that is a minor distinction. Column headers defined as community standard vocabulary (e.g. Darwin core terms) will make the files easy to use/interpret. I'm not as familiar with ASV formats but HDF5 sounds very useful to keep file sizes low.

John

On Fri, Aug 30, 2019 at 4:49 PM Luke Thompson notifications@github.com wrote:

Hi Daniel (@wasade https://github.com/wasade), would you be able to comment on BIOM-format's metadata functionality, specifically whether it could support multiple data types, possibly through JSON? Please see the thread above for reference. Thanks! Luke

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/GLOMICON/asvBiomXchange/issues/8?email_source=notifications&email_token=AAIZ3RIILLUSUSQATYBGFD3QHGW2HA5CNFSM4IBIOIG2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5TAMUQ#issuecomment-526779986, or mute the thread https://github.com/notifications/unsubscribe-auth/AAIZ3RNXV4R7V7WTMSAVBZDQHGW2HANCNFSM4IBIOIGQ .

-- John Deck (541) 914-4739

wasade commented 4 years ago

What about encapsulating this project into redbiom? I think that effort covers many of the stated aims of the Xchange from what I'm seeing on the readme. redbiom is (a) permissive in the metadata you can index (b) allows for metadata extraction as TSV or CSV (c) naturally represents the taxonomy (d) the table data can be extracted as BIOM v2.1.0 (HDF5) and (e) allows for representing preparation specific processing. It's highly scalable as well -- we can index hundreds of thousands of samples from Qiita in about 30GB.

jdeck88 commented 4 years ago

This looks like a good idea. To build the workflow around redbiom it would need to have a solid API... is that readily available?

On Tue, Sep 3, 2019 at 9:30 AM Daniel McDonald notifications@github.com wrote:

What about encapsulating this project into redbiom https://github.com/biocore/redbiom? I think that effort covers many of the stated aims of the Xchange from what I'm seeing on the readme. redbiom is (a) permissive in the metadata you can index (b) allows for metadata extraction as TSV or CSV (c) naturally represents the taxonomy (d) the table data can be extracted as BIOM v2.1.0 (HDF5) and (e) allows for representing preparation specific processing. It's highly scalable as well -- we can index hundreds of thousands of samples from Qiita in about 30GB.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/GLOMICON/asvBiomXchange/issues/8?email_source=notifications&email_token=AAIZ3RMC4AX5OXWPYQJYS73QH2GKZA5CNFSM4IBIOIG2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5YZAZA#issuecomment-527536228, or mute the thread https://github.com/notifications/unsubscribe-auth/AAIZ3RKNQ7E3LMHHWT4VI53QH2GKZANCNFSM4IBIOIGQ .

-- John Deck (541) 914-4739

wasade commented 4 years ago

The description of redbiom is published here.

A QIIME 2 oriented tutorial is here highlighting interaction on the command line.

The Python API is not formally described outside of the code at the moment, but all API methods conform to numpydoc standards (e.g., redbiom.fetch.data_from_samples) so rendering with Sphinx as we do for BIOM-Format and scikit-bio would be relatively straight forward. The command line interface is actually just a thin wrapper around the API methods.

Interaction with the database is performed indirectly through a RESTful API provided by webdis. The API is just the Redis database commands though. The Redis commands and key structures are part of the documentation as well for the methods which interact with Redis (e.g., in the private method redbiom.fetch._biom_from_samples). The data model for Redis is laid out as well in the readme. What this allows for, though does not exist but which would be awesome, is a Javascript layer of interaction with the RESTful API.

I'm eager to grow redbiom too and happy to work with the team here to expand functionality, documentation, etc where necessary.

And, just a minor aside, if you're going to the project page right now it will indicate the build is presently failing on py36 (not py35 or py27). This is due to an output order sensitivity which is on my list of things to address.