audeering / audbcards

Data cards for audio datasets
https://audeering.github.io/audbcards/
Other
0 stars 0 forks source link

Add possibility to preview tables #59

Closed hagenw closed 2 months ago

hagenw commented 7 months ago

It might be of interest to allow an interactive preview of tables on the datacard.

E.g. one solution could be to pre-load the first 10 lines for every table and add them to the static web page. Another solution might be to provide an interface for selecting a table to preview, and first 10 lines from the table is only then read (and maybe downloaded) when requested.

hagenw commented 7 months ago

For the preview of tables, it would also be interesting to see if we can profit here if we would store the tables in a different format (e.g. parquet) in the repository. E.g. if it would be possible to not download the whole table, but just stream the first 10 lines from the repo when requested.

hagenw commented 3 months ago

Good news, when storing tables as PARQUET files on the backend, we can preview them without the need to download the whole file.

The following example highlights it with a dependency table (as we don't have a real table yet published on the server) from our internal server (copied from https://github.com/audeering/audformat/issues/376#issuecomment-2182711399):

import aiohttp
import fsspec
import pyarrow.parquet as parquet

import audbackend

host = "https://artifactory.audeering.com/artifactory"
auth = audbackend.backend.Artifactory.get_authentication(host)
repository = "data-public-local"

# Prepare fsspec https file-system to communicate with Artifactory
fs = fsspec.filesystem("https", auth=aiohttp.BasicAuth(auth[0], auth[1]))

# Preview dependency table of casual-conversations-v2 dataset
dataset = "casual-conversations-v2"
version = "1.0.0"
url = f"{host}/{repository}/{dataset}/db/{version}/db-{version}.parquet"
file = parquet.ParquetFile(url, filesystem=fs)
first_ten_rows = next(file.iter_batches(batch_size=10))
print(first_ten_rows.to_pandas())

which returns

                                      file                               archive  bit_depth  channels  ... removed  sampling_rate type  version
0                      db.disabilities.csv                          disabilities          0         0  ...       0              0    0    1.0.0
1                             db.files.csv                                 files          0         0  ...       0              0    0    1.0.0
2               db.physical-adornments.csv                   physical-adornments          0         0  ...       0              0    0    1.0.0
3               db.physical-attributes.csv                   physical-attributes          0         0  ...       0              0    0    1.0.0
4                         db.recording.csv                             recording          0         0  ...       0              0    0    1.0.0
5                         db.skin-tone.csv                             skin-tone          0         0  ...       0              0    0    1.0.0
6                           db.speaker.csv                               speaker          0         0  ...       0              0    0    1.0.0
7  audio/0000_portuguese_nonscripted_1.wav  f76b3d4a-a172-63ee-22f2-fb2255d692ee         16         1  ...       0          48000    1    1.0.0
8  audio/0000_portuguese_nonscripted_2.wav  81db070f-69a1-ab92-a365-ca95ac36c893         16         1  ...       0          48000    1    1.0.0
9  audio/0000_portuguese_nonscripted_3.wav  d4572eb1-d458-7717-2145-a7861208b8da         16         1  ...       0          48000    1    1.0.0

[10 rows x 11 columns]

Which means it should now be much easier to integrate a fast table preview feature, at least for tables we store in PARQUET. For the CSV tables it might be slightly more complicated as those are stored inside a ZIP file, and we would need to download the first 10 rows of that file from within the ZIP file. I think it should also be possible, but I don't know how yet.

/cc @ChristianGeng