Open wfondrie opened 1 year ago
I like the idea. I've been hacking around on a parquet-based replacement/supplement to mzML, and I think the metadata route is how I would go for storing some of the "less important" cvParams/metadata or things that are globally applied for a file.
As you mentioned, support for various parquet features can be somewhat scattered across the ecosystem. Polars, for instance, doesn't support the MAP (dictionary of key->value) column type.
Hi folks 👋 - I was curious we've considered embedding metadata in the Parquet file schemas. The format does allow for adding arbitrary key value pairs at both the table and column levels. Here is a small python example 👇
First we create a toy arrow table with no metadata:
Then we can update the schema to add metadata (note that this can also be done during table creation):
This is a shallow copy, share data, but not the schema metadata with the other copy:
And we can persist this metadata in parquet files:
My thought is that this may be a good way to embed metadata, such as what kind of table it is, sparingly. What do you think? The biggest downside I see for now is that the feature is not very well known/documented for Parquet and I'm not sure how well supported it is across Arrow APIs.