ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
Dataset verification is slow when verifying lots of small files. This is especially true on e.g. NFS drives.
To reproduce
Download a dataset, then download it again.
from clearml import Dataset
d = Dataset.get(dataset_id="abcdefg")
# Populate cache, verification happens here and is slow
d.get_local_copy()
# Verification on a pre-downloaded/cached dataset is also slow
d.get_local_copy()
Expected behaviour
Verification (i.e. file size checking) can theoretically happen in parallel on certain disk types - especially NFS drives that have multiple copies of stored data (e.g. Ceph, GlusterFS, or in my case, GCP Filestore).
Describe the bug
Dataset verification is slow when verifying lots of small files. This is especially true on e.g. NFS drives.
To reproduce
Download a dataset, then download it again.
Expected behaviour
Verification (i.e. file size checking) can theoretically happen in parallel on certain disk types - especially NFS drives that have multiple copies of stored data (e.g. Ceph, GlusterFS, or in my case, GCP Filestore).
Environment
Related Discussion
Slack thread: https://odin-vision.slack.com/archives/C055MNE258R/p1696591022780369