allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.71k stars 657 forks source link

Bug: PermissionError: [WinError 5] while uploading dataset #1348

Open Octoslav opened 1 week ago

Octoslav commented 1 week ago

Describe the Bug

When uploading a dataset using clearml-data sync on Windows, the following error intermittently occurs: PermissionError: [WinError 5] Access is denied.

Investigation

The error is triggered on the following line in the code:
https://github.com/allegroai/clearml/blob/master/clearml/binding/artifacts.py#L1141C18-L1141C19.

This issue is unstable and difficult to reproduce consistently. However, it becomes a significant problem when working with large datasets. For instance, my team faced this error while attempting to upload a dataset containing 700 000 items. The process took approximately 5 hours to compute all file hashes, only for the WinError 5 to unexpectedly appear.

Steps to Reproduce

The following script simulates the aforementioned line of code and demonstrates the issue:

import os
from tempfile import mkstemp
from tqdm import trange, tqdm
from shutil import move

for i in trange(100000):
    fd, local_filename = mkstemp()
    os.close(fd)
    fd, temp_filename = mkstemp()
    os.close(fd)

    try:
        os.replace(local_filename, temp_filename)
    except PermissionError:
        tqdm.write("PermissionError")

    os.remove(temp_filename)

When running this script on multiple Windows machines, I observed about 6 PermissionError exceptions over 100 000 iterations.
Running the script as administrator does not resolve the issue.

Possible Solution

I suggest replacing the use of os.replace(local_filename, temp_filename) with shutil.move(local_filename, temp_filename). Both methods achieve the same functionality, but shutil.move does not raise the PermissionError on Windows.

jkhenning commented 1 week ago

@Octoslav are you saying that os.replace() raised by error by mistake? Or was there a concrete reason for it?

Octoslav commented 1 week ago

@Octoslav are you saying that os.replace() raised by error by mistake? Or was there a concrete reason for it?

As far as I know the race condition can occur: os.close() is executed, but Windows still keeps the file open for a short time in some cases. So os.replace(local_filename, temp_filename) throws an error.

shutil.move is a high level function and can handle such a case.

Octoslav commented 1 week ago

@jkhenning Hi!

Do you have any thoughts? The error is quite annoying for windows users in my team.

jkhenning commented 1 week ago

The solution seems OK to me, I'm just not comfortable with hiding errors in general 😄 Does this only happen in Windows? If so, I would prefer to have this behavior specific to windows, while keeping the previous behavior for other platforms

Octoslav commented 1 week ago

@jkhenning

Does this only happen in Windows?

Didn't managed to reproduced it on linux machines

I'm just not comfortable with hiding errors in general

I didn't hide the error, I just used an another instrument that works better on Windows

jkhenning commented 1 week ago

Any idea on the implications of this change in linux or other platforms?

Octoslav commented 3 days ago

@jkhenning

shutil.move is multi-platformand works well everywhere