audeering / audb

Manage audio and video databases
https://audeering.github.io/audb/
Other
22 stars 1 forks source link

Add support for PARQUET file tables #433

Closed hagenw closed 1 week ago

hagenw commented 3 weeks ago

In https://github.com/audeering/audformat/pull/419 we introduced storing tables as PARQUET files in audformat. They differ in two important aspects from tables stored in CSV files:

Which means, we need to adjust the code to handle them.

As stated in https://github.com/audeering/audformat/pull/419, the hash of those files can be read with:

import pyarrow.parquet as parquet

hash = parquet.read_schema(path).metadata[b"hash"].decode()
hagenw commented 1 week ago

One question popping up here was if we need to take partitioning into account of the parquet files.

I'm not 100% sure how it would influence the possibility to stream the data from backends, but to the easiest solution seems to be to store a single parquet file on the server, and use partitioning of parquet files when storing those in the cache. At the moment we store pickle files in the cache, but for large data that does not fit into memory we will need to change this. A first solution is to rely on streaming and don't store them in cache, a second solution could be to store the data as partitions in cache, which will then hopefully still fit into memory.

Storing the partitions only in cache might also solve the issue of which possible partitions to store if a dataset provides >1000 possible partitions, as we could store the data in cache when a user request a particular partition.