bigbio / quantms.io

The proteomics quantification format, extending mzTab for large scale datasets.
Other
5 stars 4 forks source link

Embedding metadata in Parquet files? #4

Open wfondrie opened 1 year ago

wfondrie commented 1 year ago

Hi folks 👋 - I was curious we've considered embedding metadata in the Parquet file schemas. The format does allow for adding arbitrary key value pairs at both the table and column levels. Here is a small python example 👇

First we create a toy arrow table with no metadata:

import pyarrow as pa
import pyarrow.parquet as pq

# Create a toy dataset:
no_meta_table = pa.table(dict(
    characters=["Luke", "Han", "Leia", "Ben"],
    is_jedi=[False, False, False, True],
))

# By default, we have no metadata here:
assert no_meta_table.schema.metadata is None

Then we can update the schema to add metadata (note that this can also be done during table creation):

metadata = {
    "movie": "A New Hope", 
    "episode": "4", 
    "year": "1977",
}

meta_table = no_meta_table.replace_schema_metadata(metadata)

# We now have metadata here:
print(meta_table.schema.metadata)
# {b'movie': b'A New Hope', b'episode': b'4', b'year': b'1977'}

This is a shallow copy, share data, but not the schema metadata with the other copy:

# We still have no metadata in the original table:
assert no_meta_table.schema.metadata is None

And we can persist this metadata in parquet files:

# Still no metadata after write->read:
pq.write_table(no_meta_table, "no_meta.parquet")
parsed_no_meta_table = pq.read_table("no_meta.parquet")
assert parsed_no_meta_table.schema.metadata is None

# Persisted metadata after write-read:
pq.write_table(meta_table, "meta.parquet")
parsed_meta_table = pq.read_table("meta.parquet")
print(parsed_meta_table.schema.metadata)
# {b'movie': b'A New Hope', b'episode': b'4', b'year': b'1977'}

My thought is that this may be a good way to embed metadata, such as what kind of table it is, sparingly. What do you think? The biggest downside I see for now is that the feature is not very well known/documented for Parquet and I'm not sure how well supported it is across Arrow APIs.

lazear commented 1 year ago

I like the idea. I've been hacking around on a parquet-based replacement/supplement to mzML, and I think the metadata route is how I would go for storing some of the "less important" cvParams/metadata or things that are globally applied for a file.

As you mentioned, support for various parquet features can be somewhat scattered across the ecosystem. Polars, for instance, doesn't support the MAP (dictionary of key->value) column type.