allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.43k stars 643 forks source link

Datasets with local storage & changing output_uri #1217

Closed nfzd closed 4 months ago

nfzd commented 4 months ago

Proposal Summary

Make the base output_uri of Dataset artifacts configurable somehow if the storage location changes.

One possibility would be to add a kwarg to Dataset.finalize() and Dataset.get() which can rename the artifact URIs. I could send a PR for that if you agree this is a good solution?

Motivation

We need to store our datasets on a network drive. We also have Linux workers and users with Windows.

The network drive has some location, say, /mnt/data on the workers. This path cannot be be used on Windows, where it will be something like Z:\. (We tried some hacks with network paths on Windows, but did not find a working solution.)

Windows users should be able to both create datasets and use them locally. The Linux agents also need to be able to load them.

Proposal

The clean solution IMHO would be to store path that the agents will use on the server. This would require something like:

  1. On Windows, create the dataset, use output_uri='Z:\'
  2. Run Dataset.finalize() with an extended version which can rename Z:\ to /mnt/data
  3. Agents: will work just fine.
  4. Loading the dataset on windows: run Dataset.get() with an extended version which can rename /mnt/data back to Z:\

The extension would in both cases be something like a kwarg

def finalize(
    ...
    output_uri_renamer: Optional[Callable] = None,
    ....
)

which (if passed) can rename the artifact paths before saving in finalize() and before loading in get(). You would call it, in our case, with:

dataset.finalize(
    output_uri_renamer=lambda path: path.replace("\", "/").replace("Z:", "/mnt/data") 
)

Related: #747

ainoam commented 4 months ago

@nfzd I think this is exactly the scenario for which path substitution was introduced, is it not?

It should simply be configured for each consumer for which the original registered URL is inadequate.

nfzd commented 4 months ago

@ainoam Ah, nice. I was not aware of that.