file locks for concurrent downloads from the same url

rishabh-ranjan commented 6 months ago

Description of the desired feature:

Imagine a scenario where we have two processes P1 and P2 running which both call pooch.retrieve("<some_url>", known_hash=None). This url is not available in the cache. In my tests, it looks like this will lead to a race condition with two processes downloading to the same location.

Feature request: file locking to ensure that only one process downloads and the other processes wait for the download to complete and simply return the file.

In my usecase I am running many workers in parallel, and every worker downloads the same file to work on it. I want to ensure that this does not lead to bugs when the file is not already downloaded and multiple processes are requesting it.

Are you willing to help implement and maintain this feature?

I am not an expert on this stuff, but if it comes down to it, sure I can help maintain.

santisoler commented 6 months ago

Hi @rishabh-ranjan. Thanks for opening this Issue.

I understand the issue you're facing and I see how enabling lockfiles could solve your problem. Nonetheless, considering that supporting async runs of Pooch is kind of out of the scope of the project, I'm hesitant to implement lockfiles unless there's no better alternative.

Would it be possible for you to download the given file before distributing the workload in different processes?

For example, instead of having something like:

import pooch
from multiprocessing import Pool

with Pool() as p:
    file_path = pooch.retrieve(
        url="https://github.com/fatiando/pooch/raw/v1.0.0/data/tiny-data.txt",
        known_hash="md5:70e2afd3fd7e336ae478b1e740a5f08e",
        path="my_dir",
    )
    # Perform tasks with the file
    ...

Could you rewrite it to?

import pooch
from multiprocessing import Pool

file_path = pooch.retrieve(
    url="https://github.com/fatiando/pooch/raw/v1.0.0/data/tiny-data.txt",
    known_hash="md5:70e2afd3fd7e336ae478b1e740a5f08e",
    path="my_dir",
)
with Pool() as p:
    # Perform tasks with the file
    ...

Please, don't take my comment the wrong way. I'm open to discuss this and implement lockfiles if there's a very compelling reason to do so. Looking forward to your reply.

rishabh-ranjan commented 6 months ago

Hi @santisoler, thanks for the reply. Downloading the file beforehand is definitely a valid solution. However, I feel that having lockfiles would make things simpler and more robust, especially given that pooch abstracts out most of the other data downloading/caching logic by itself.

In my usecase, I am actually not using python multiprocessing, but firing multiple scripts in parallel, for example to do a hyperparameter sweep:

for lr in 0.1 0.01 0.001
do
    python train.py --dataset some_dataset --lr $lr &
done

If I start with the dataset not in cache, then this runs into the issue described above.

santisoler commented 6 months ago

I see. Could you provide a minimal example that reproduces the issue you face? Are you getting a particular error? Is the code just taking longer than it should?

However, I feel that having lockfiles would make things simpler and more robust, especially given that pooch abstracts out most of the other data downloading/caching logic by itself.

I spent some time considering how to implement this in Pooch and it seems more complicated than I initially thought. Pooch downloads the desired file to a temporary file (a file that is different for each process), checks the hash and then moves it to the final location (specified by the user through the path argument). Storing the lockfile in a temp directory won't solve the issue, since each process will be watching a different tmpdir. So, the lockfile should be created in the final location of the download so every process can see it.

Before downloading or returning the file, Pooch decides which action must be taken: download the file (if it doesn't exists), update it (if the file exists but the provided hash is different than the one of the cached file) or just fetch it (return the path to the cached file). This happens at a higher level (right in to the pooch.retrieve function and in the Pooch.fetch method). Since the decision of downloading or fetching the file should depend whether the lockfile has been released or not, the lockfile should be acquired at this higher level.

This introduces some complexities: if the process runs into an error before releasing the lockfile, then the lockfile will exist and any future call to the same pooch.retrieve will wait forever to the lockfile to be released. Therefore, users should manually remove this file. And a lot of errors could happen before releasing the lockfile at this higher level: from connection timeouts to other type of errors.

Moreover, we should also consider locking files for postprocessors. For example, using a Unzip for decompressing a downloaded .zip file will also run in race condition if it's being run in parallel.

Now I gave a closer look to this, I think the implementation won't be simple, and will probably generate a less robust codebase, introducing some corner cases that could lead to bugs that don't exist at the moment.

So, I'm wondering if precaching the desired file before distributing your workload is actually the simpler and more robust solution. I know it's not 100% pretty, but something like this would do the trick:

url="https://github.com/fatiando/pooch/raw/v1.0.0/data/tiny-data.txt"
known_hash="md5:70e2afd3fd7e336ae478b1e740a5f08e"
path="my_dir"
python -c "import pooch; pooch.retrieve(url=\"$url\", known_hash=\"$known_hash\", path=\"$path\")"

for lr in 0.1 0.01 0.001
do
    python train.py --dataset some_dataset --lr $lr &
done

What do you think?

rishabh-ranjan commented 6 months ago

Thank you for considering this in such detail. After reading through your response, I agree with you that the complexities introduced by this are better avoided. For my usecase, I am happy to download all required files beforehand.

I don't exactly remember what error I ran into, but my guess was that it could be because of race conditions between different processes downloading the same file. Since we agree that this is better left unimplemented, I am taking the liberty of not investigating further to provide a minimal reproducible example.

From what you described, it seems like pooch should be fine even if multiple processes try to download the same file, because the downloads happen in independent directories, and then the move to the final destination would be atomic on most filesystems. So, at least there should be no race conditions. Unless I am misunderstanding something or there are subtleties of post-processors like unzip that I am ignoring.

santisoler commented 6 months ago

I don't exactly remember what error I ran into, but my guess was that it could be because of race conditions between different processes downloading the same file. Since we agree that this is better left unimplemented, I am taking the liberty of not investigating further to provide a minimal reproducible example.

Sounds good. I would just say, feel free to provide if at any point you run into any error after running downloads in parallel.

From what you described, it seems like pooch should be fine even if multiple processes try to download the same file, because the downloads happen in independent directories, and then the move to the final destination would be atomic on most filesystems. So, at least there should be no race conditions. Unless I am misunderstanding something or there are subtleties of post-processors like unzip that I am ignoring.

Minor correction to my previous comment. The temporary files are downloaded in the same final directory, although each process will download the content to its own temporary file. But the conclusion is the same: the download process doesn't run into a race condition because downloads are being done to different files. Nonetheless, the same file is downloaded once for each process, which is not desirable. I haven't dig into what happens with the postprocessors yet.

I also came up with another solution to your problem that makes use of lockfiles. You can acquire a lockfile before calling the pooch.retrieve function inside your script. For example:

# file: download.py
import pooch
import lockfile

lock = lockfile.LockFile(path="foo.lock")
with lock:
    file_path = pooch.retrieve(
        url="https://github.com/fatiando/pooch/raw/v1.0.0/data/tiny-data.txt",
        known_hash="md5:70e2afd3fd7e336ae478b1e740a5f08e",
        path="my_dir",
    )

[!NOTE] Requires lockfile to be installed.

If you run this script in parallel with:

for _ in 1 2 3
do
    python download.py &
done

You'll see only one message of Downloading data from ..., meaning only one process is downloading the file, while the rest just wait for it to finish to run the pooch.retrieve function. Since the file will exists by that time, the other processes will just fetch the file.

With this solution you can keep your bash script pretty (without the need of precaching the file through that python oneliner) and avoid downloading the same file repeatedly. Hope it helps!

rishabh-ranjan commented 6 months ago

Thanks, locking at the user-level outside of pooch sounds like a great idea!

santisoler commented 6 months ago

One more note: lockfile is a deprecated project, use filelock instead: https://py-filelock.readthedocs.io

fatiando / pooch

file locks for concurrent downloads from the same url #411