Add support for PARQUET file tables

audeering / audb

Manage audio and video databases

Other

22 stars 1 forks source link

In https://github.com/audeering/audformat/pull/419 we introduced storing tables as PARQUET files in audformat. They differ in two important aspects from tables stored in CSV files:

their changes cannot be tracked by audeer.md5(), but by a hash stored in their header
they shouldn't be stored in a ZIP file on the server, to make it easier to preview them (and they are already compressed)

Which means, we need to adjust the code to handle them.

As stated in https://github.com/audeering/audformat/pull/419, the hash of those files can be read with:

import pyarrow.parquet as parquet

hash = parquet.read_schema(path).metadata[b"hash"].decode()

One question popping up here was if we need to take partitioning into account of the parquet files.

I'm not 100% sure how it would influence the possibility to stream the data from backends, but to the easiest solution seems to be to store a single parquet file on the server, and use partitioning of parquet files when storing those in the cache. At the moment we store pickle files in the cache, but for large data that does not fit into memory we will need to change this. A first solution is to rely on streaming and don't store them in cache, a second solution could be to store the data as partitions in cache, which will then hopefully still fit into memory.

Storing the partitions only in cache might also solve the issue of which possible partitions to store if a dataset provides >1000 possible partitions, as we could store the data in cache when a user request a particular partition.

audeering / audb

Add support for PARQUET file tables #433