allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.61k stars 651 forks source link

Bug/Enhancement: Slow dataset verification #1130

Closed charlienewey-odin closed 12 months ago

charlienewey-odin commented 12 months ago

Describe the bug

Dataset verification is slow when verifying lots of small files. This is especially true on e.g. NFS drives.

To reproduce

Download a dataset, then download it again.

from clearml import Dataset

d = Dataset.get(dataset_id="abcdefg")

# Populate cache, verification happens here and is slow
d.get_local_copy()

# Verification on a pre-downloaded/cached dataset is also slow
d.get_local_copy()

Expected behaviour

Verification (i.e. file size checking) can theoretically happen in parallel on certain disk types - especially NFS drives that have multiple copies of stored data (e.g. Ceph, GlusterFS, or in my case, GCP Filestore).

Environment

Related Discussion

Slack thread: https://odin-vision.slack.com/archives/C055MNE258R/p1696591022780369