NicolasHug / Surprise

A Python scikit for building and analyzing recommender systems
http://surpriselib.com
BSD 3-Clause "New" or "Revised" License
6.4k stars 1.01k forks source link

FileExistsError due to race condition when creating builtin dataset directory #347

Closed lrebscher closed 4 years ago

lrebscher commented 4 years ago

Description

The creation of the dataset directory is not thread-safe and is subject to race condition.

This can sometimes result in the following uncaught exception: FileExistsError: [Errno 17] File exists: '/root/.surprise_data/'.

The affected method is displayed below. The condition checking if the folder exists and if not creating the directory is subject to race conditions as os.makedirs(folder) will fail if the directory exists.

File "/usr/local/lib/python3.7/site-packages/surprise/builtin_datasets.py", line 23, in get_dataset_dir

def get_dataset_dir():
    '''Return folder where downloaded datasets and other data are stored.
    Default folder is ~/.surprise_data/, but it can also be set by the
    environment variable ``SURPRISE_DATA_FOLDER``.
    '''

    folder = os.environ.get('SURPRISE_DATA_FOLDER', os.path.expanduser('~') +
                            '/.surprise_data/')
    if not os.path.exists(folder):
        os.makedirs(folder)

    return folder

This problem and two possible solutions for it are described in https://stackoverflow.com/a/42545343 .

This error has been observed when using the library in an application served by gunicorn with multiple gthreads.

Steps/Code to Reproduce

TODO: provide minimal code example.

Expected Results

If the builtin dataset directory already exists, it will be ignored or FileExistsError will be caught and ignored.

Actual Results

In a setup with multiple threads the library might throw a FileExistsError due to a race condition.

Versions

In [1]: import platform; print(platform.platform())                                                                                                                                                                                       
Darwin-19.4.0-x86_64-i386-64bit

In [2]: import sys; print("Python", sys.version)                                                                                                                                                                                          
Python 3.7.7 (default, Mar 10 2020, 15:43:33) 
[Clang 11.0.0 (clang-1100.0.33.17)]

In [3]: import surprise; print("surprise", surprise.__version__)                                                                                                                                                                          
surprise 1.1.0
lrebscher commented 4 years ago

I will provide a minimal example in the next days. However, as it is a race condition it won't be reproducible on each run.

I'm happy to contribute to this library by fixing this bug once it is accepted. Thank you!

NicolasHug commented 4 years ago

thanks for the report, sounds legit. I'm happy to review a PR!

NicolasHug commented 4 years ago

Should be fixed by #359

Thanks!