Closed hagenw closed 2 months ago
For the preview of tables, it would also be interesting to see if we can profit here if we would store the tables in a different format (e.g. parquet) in the repository. E.g. if it would be possible to not download the whole table, but just stream the first 10 lines from the repo when requested.
Good news, when storing tables as PARQUET files on the backend, we can preview them without the need to download the whole file.
The following example highlights it with a dependency table (as we don't have a real table yet published on the server) from our internal server (copied from https://github.com/audeering/audformat/issues/376#issuecomment-2182711399):
import aiohttp
import fsspec
import pyarrow.parquet as parquet
import audbackend
host = "https://artifactory.audeering.com/artifactory"
auth = audbackend.backend.Artifactory.get_authentication(host)
repository = "data-public-local"
# Prepare fsspec https file-system to communicate with Artifactory
fs = fsspec.filesystem("https", auth=aiohttp.BasicAuth(auth[0], auth[1]))
# Preview dependency table of casual-conversations-v2 dataset
dataset = "casual-conversations-v2"
version = "1.0.0"
url = f"{host}/{repository}/{dataset}/db/{version}/db-{version}.parquet"
file = parquet.ParquetFile(url, filesystem=fs)
first_ten_rows = next(file.iter_batches(batch_size=10))
print(first_ten_rows.to_pandas())
which returns
file archive bit_depth channels ... removed sampling_rate type version
0 db.disabilities.csv disabilities 0 0 ... 0 0 0 1.0.0
1 db.files.csv files 0 0 ... 0 0 0 1.0.0
2 db.physical-adornments.csv physical-adornments 0 0 ... 0 0 0 1.0.0
3 db.physical-attributes.csv physical-attributes 0 0 ... 0 0 0 1.0.0
4 db.recording.csv recording 0 0 ... 0 0 0 1.0.0
5 db.skin-tone.csv skin-tone 0 0 ... 0 0 0 1.0.0
6 db.speaker.csv speaker 0 0 ... 0 0 0 1.0.0
7 audio/0000_portuguese_nonscripted_1.wav f76b3d4a-a172-63ee-22f2-fb2255d692ee 16 1 ... 0 48000 1 1.0.0
8 audio/0000_portuguese_nonscripted_2.wav 81db070f-69a1-ab92-a365-ca95ac36c893 16 1 ... 0 48000 1 1.0.0
9 audio/0000_portuguese_nonscripted_3.wav d4572eb1-d458-7717-2145-a7861208b8da 16 1 ... 0 48000 1 1.0.0
[10 rows x 11 columns]
Which means it should now be much easier to integrate a fast table preview feature, at least for tables we store in PARQUET. For the CSV tables it might be slightly more complicated as those are stored inside a ZIP file, and we would need to download the first 10 rows of that file from within the ZIP file. I think it should also be possible, but I don't know how yet.
/cc @ChristianGeng
It might be of interest to allow an interactive preview of tables on the datacard.
E.g. one solution could be to pre-load the first 10 lines for every table and add them to the static web page. Another solution might be to provide an interface for selecting a table to preview, and first 10 lines from the table is only then read (and maybe downloaded) when requested.