allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.67k stars 653 forks source link

Relative path is incorrect when using `Dataset.add_external_files` for local file without the `file://` prefix #1313

Open geometrikal opened 2 months ago

geometrikal commented 2 months ago

Describe the bug

Dataset.add_external_file will allow adding external files from a source url without the file:// prefix, e.g. /path/to/my/file.csv. It will create the link correctly in the metadat.json by prepending file:// however it will remove characters from the file name when creating the relative link.

To reproduce

dataset = Dataset.create(
        dataset_project="tes",
        dataset_name="test"
)
dataset.add_external_files("/mnt/f/0123456789.txt")
dataset.finalize(auto_upload=True)
print(dataset.list_files())
# ['789.txt']

Expected behaviour

Either throw an exception or correctly create the relative path

Environment

eugen-ajechiloae-clearml commented 2 months ago

Hi @geometrikal ! Thanks for reporting this. We will get back to you once this is fixed.

d-vignesh commented 2 months ago

Hi @eugen-ajechiloae-clearml i would like to work on this. Can i pick it up. I am new to the repo, can you please help with some guidance on what has to be done here.

eugen-ajechiloae-clearml commented 2 months ago

Hi @d-vignesh ! Yes, all contributions are welcome, and no-one picked this one up as far as I'm aware. It looks like when adding a file using add_external_files, the file name will be stripped of the first len("file://") characters. I I believe that we don't really support paths that are not prefixed by file:// at all, so, in this function: https://github.com/allegroai/clearml/blob/b1e39e68973182d66b43edc502b58d955b5bc779/clearml/datasets/dataset.py#L3322 I would add a check: if the file is a local path without file:// then get the absolute path and add the file:// prefix. This should be tested using absolute paths, relative paths, on both Windows and Linux/MacOS.

d-vignesh commented 2 months ago

Hi @eugen-ajechiloae-clearml can be please clarify on below doubt, when a source_url without file:// prefix is provided,

  1. the code defaults to file:// as base url and hence the metadata is having the expected name value
  2. during constructing the relative path
    link = remote_object.get("name")
    relative_path = link[len(source_url):]
    if not relative_path:
     relative_path = source_url.split("/")[-1]

    for example, if the source_url is /Users/docs/file1.txt, the link value will be file://Users/docs/file1.txt. Hence the relative path will be assigned as le1.txt in line 2. Instead if we use relative_path = source_url.split("/")[-1] always we should be getting the correct relative path. What scenario is the line relative_path = link[len(source_url):] handling in this case?

    image
eugen-ajechiloae-clearml commented 1 month ago

Hi @d-vignesh ! There are 2 cases when adding external files: you can either add a "directory" or a direct path. In case the source URL is the direct path, we fallback to relative_path = source_url.split("/")[-1]

d-vignesh commented 1 month ago

thank you @eugen-ajechiloae-clearml , got it. Please verify my understanding, that user can give both absolute path /User/project/file1.txt or relative path ./file1.txt if in project directory. In relative path case we need to get the absolute path as /User/project/file1.txt . Then finally append the file:// prefix. This modified path will then be used as the source_url.

d-vignesh commented 1 month ago

Assuming the above understanding is right, the below logic is working for all scenarios

source_url = '../data/file2.txt'
parsed = urlparse(source_url)
if parsed.scheme == "":
    abs_path = os.path.abspath(source_url)
    source_url = f"file://{abs_path}"
print(source_url)

Adding this check at the beginning of the add_external_file function seems to resolve the issue. Am i missing any case?

image
d-vignesh commented 1 month ago

Hi @eugen-ajechiloae-clearml , created a PR for the fix https://github.com/allegroai/clearml/pull/1326. Can you please have a look at it.