allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.42k stars 643 forks source link

Local data sync into clearml-data #1246

Open nikiniki1 opened 2 months ago

nikiniki1 commented 2 months ago

Hi! I'm going to use clearml data like this:

  1. I Have dataset probably around 700Gb. When I want to solve a problem, I select a subsample from them and use it as a train/test data. And when I feed only txt with paths (data_path) of subsample.
  2. So, when I use clearml I have to initalize dataset = Dataset()) and after call dataset.sync_folder(). But if I use it this way, then clearml will chunk my data and load it into filestorage, so I end up with duplicates in the data.
  3. I don’t want clearml to duplicate the data, but I just want it to monitor the shared folder with all the data and show only the paths for the selected ones. How can I solve this problem?
ainoam commented 2 months ago

@nikiniki1 Dataset.sync_folder is intended to do exactly that: synchronize data between two locations. If your use case uses a single location, I think Dataset.add_external_files is what you need.

Does this help?