BlazingDB / blazingsql

BlazingSQL is a lightweight, GPU accelerated, SQL engine for Python. Built on RAPIDS cuDF.
https://blazingsql.com
Apache License 2.0
1.92k stars 181 forks source link

[BUG] Log Directory creation causes error (unless it exists already) #1560

Open jglaser opened 3 years ago

jglaser commented 3 years ago

Describe the bug

Running on 90 workers, I get the following error

Could not create directory: /gpfs/alpine/proj-shared/gen119/bsql_shared/logs_ucx_1060213[Errno 17] File exists: '/gpfs/alpine/proj-shared/gen119/bsql_shared/logs_ucx_1060213'
distributed.worker - WARNING - Compute Failed
Function:  initialize_server_directory
args:      ('/gpfs/alpine/proj-shared/gen119/bsql_shared/logs_ucx_1060213', True)
kwargs:    {}
Exception: FileExistsError(17, 'File exists')

The directory /gpfs/alpine/proj-shared/gen119/bsql_shared/logs_ucx_1060213 did not exist prior to launching the job.

Steps/Code to reproduce bug

Launch BlazingSQL on a sufficient number of workers to trigger the race condition, set LOG to the above directory (and make sure it doesn't exist yet), and set the following environment variables

export BLAZING_LOGGING_DIRECTORY=${LOG}
export BLAZING_LOCAL_LOGGING_DIRECTORY=${LOG}
export BSQL_BLAZING_LOGGING_DIRECTORY=${LOG}
export BSQL_BLAZING_LOCAL_LOGGING_DIRECTORY=${LOG}
export ENABLE_COMMS_LOGS=False
export BSQL_ENABLE_COMMS_LOGS=False
export BSQL_ENABLE_TASK_LOGS=True
export BSQL_ENABLE_OTHER_ENGINE_LOGS=True
export RMM_DEBUG_LOG_FILE=${LOG}/rmm_log.txt

Expected behavior

The directory should be silently created if it doesn't exist yet.

Environment details Please run and paste the output of the print_env.sh script here, to gather any other relevant environment details

Additional context

Suspected source of the issue

in pyblazing/apiv2/context.py

def initialize_server_directory(dir_path, is_dask):
    if not os.path.exists(dir_path):
        try:
            os.mkdir(dir_path)
        except OSError as error:
            get_blazing_logger(is_dask).error(
                f"Could not create directory: {dir_path}" + str(error)
             )
            raise
        return True
    else:
        return True

This should intercept the FileExistsError and then silently return (instead of using os.path.exists, which results in a race condition).