Asynchronous query execution sometimes stuck in PENDING status

apache / superset

Apache Superset is a Data Visualization and Data Exploration Platform

https://superset.apache.org/

Apache License 2.0

62.07k stars 13.61k forks source link

Asynchronous query execution sometimes stuck in PENDING status #29401

Closed Habeeb556 closed 2 months ago

Habeeb556 commented 3 months ago

Bug description

I encountered the following issue: all ASYNC queries are PENDING.

The log shows that the queries are passed to the workers but do not return.

superset[3920984]: Triggering query_id: 58629
superset[3920984]: 2024-06-28 15:25:08,876:INFO:superset.commands.sql_lab.execute:Triggering query_id: 58629
superset[3920984]: Query 58629: Running query on a Celery worker
superset[3920984]: 2024-06-28 15:25:08,886:INFO:superset.sqllab.sql_json_executer:Query 58629: Running query on a Celery worker
superset[3920984]: Triggering query_id: 58630
superset[3920984]: 2024-06-28 15:25:10,448:INFO:superset.commands.sql_lab.execute:Triggering query_id: 58630
superset[3920984]: Query 58630: Running query on a Celery worker
superset[3920984]: 2024-06-28 15:25:10,456:INFO:superset.sqllab.sql_json_executer:Query 58630: Running query on a Celery worker

How to reproduce the bug

Allow using Asynchronous query execution in database. When there a heavy load the queries get stuck in PENDING status.

Screenshots/recordings

Superset version

4.0.1

Python version

3.11

Node version

I don't know

Browser

Chrome

Additional context

No response

Checklist

[X] I have searched Superset docs and Slack and didn't find a solution to my problem.
[X] I have searched the GitHub issue tracker and didn't find a similar bug report.
[ ] I have checked Superset's logs for errors and if I found a relevant Python stacktrace, I included it here as text in the "additional context" section.

dosubot[bot] commented 3 months ago

To resolve the issue where asynchronous (ASYNC) queries are getting stuck in the PENDING status, ensure the following:

Cache Configuration: Verify that the cache backends are properly configured and non-null. The init_app method in the AsyncQueryManager class raises an exception if the cache backends are not configured correctly:

def init_app(self, app: Flask) -> None:
    config = app.config
    if (
        config["CACHE_CONFIG"]["CACHE_TYPE"] == "null"
        or config["DATA_CACHE_CONFIG"]["CACHE_TYPE"] == "null"
    ):
        raise Exception(
            """
            Cache backends (CACHE_CONFIG, DATA_CACHE_CONFIG) must be configured
            and non-null in order to enable async queries
            """
        )

Global Async Queries Configuration: Ensure that the GLOBAL_ASYNC_QUERIES feature is enabled and that the transport method for async queries is correctly configured. The transport method can be either polling or ws (websocket). Here is an example configuration:
```
const config = {
  GLOBAL_ASYNC_QUERIES_TRANSPORT: 'polling', // or 'ws'
  GLOBAL_ASYNC_QUERIES_POLLING_DELAY: 50, // polling delay in milliseconds
  GLOBAL_ASYNC_QUERIES_WEBSOCKET_URL: 'ws://127.0.0.1:8080/', // websocket URL if using 'ws' transport
};
```

Celery Workers: Ensure that the Celery workers are properly configured and running. The execute method in the ASynchronousSqlJsonExecutor class should handle the task submission to Celery workers:

class ASynchronousSqlJsonExecutor(SqlJsonExecutorBase):
    def execute(
        self,
        execution_context: SqlJsonExecutionContext,
        rendered_query: str,
        log_params: dict[str, Any] | None,
    ) -> SqlJsonExecutionStatus:
        query_id = execution_context.query.id
        logger.info("Query %i: Running query on a Celery worker", query_id)
        try:
            task = self._get_sql_results_task.delay(
                query_id,
                rendered_query,
                return_results=False,
                store_results=not execution_context.select_as_cta,
                username=get_username(),
                start_time=now_as_float(),
                expand_data=execution_context.expand_data,
                log_params=log_params,
            )
            try:
                task.forget()
            except NotImplementedError:
                logger.warning(
                    "Unable to forget Celery task as backend"
                    "does not support this operation"
                )
        except Exception as ex:
            logger.exception("Query %i: %s", query_id, str(ex))

            message = __("Failed to start remote query on a worker.")
            error = SupersetError(
                message=message,
                error_type=SupersetErrorType.ASYNC_WORKERS_ERROR,
                level=ErrorLevel.ERROR,
            )
            error_payload = dataclasses.asdict(error)
            query = execution_context.query
            query.set_extra_json_key("errors", [error_payload])
            query.status = QueryStatus.FAILED
            query.error_message = message
            raise SupersetErrorException(error) from ex
        self._query_dao.update_saved_query_exec_info(query_id)
        return SqlJsonExecutionStatus.QUERY_IS_RUNNING

Ensure that the Celery broker and backend configurations are correct and that the workers can communicate with the main application ^[1]^[2]^[3].

To continue talking to Dosu, mention @dosu.

Habeeb556 commented 3 months ago

@dosu

Yes, configured as below:

# Redis Caching
CACHE_NO_NULL_WARNING = True
CACHE_CONFIG = {
    'CACHE_TYPE': 'redis',
    'CACHE_DEFAULT_TIMEOUT': 86400,
    'CACHE_KEY_PREFIX': 'superset_results',
    'CACHE_REDIS_HOST': 'localhost',
    'CACHE_REDIS_PORT': 6379,
    'CACHE_REDIS_DB': 1,
    'CACHE_REDIS_URL': 'redis://localhost:6379/1'
}

DATA_CACHE_CONFIG = {
    'CACHE_TYPE': 'redis',
    'CACHE_DEFAULT_TIMEOUT': 60 * 60 * 24, # 1 day default (in secs)
    'CACHE_KEY_PREFIX': 'superset_results',
    'CACHE_REDIS_URL': 'redis://localhost:6379/0',
}

# Long running query handling using Celery workers
class CeleryConfig(object):
    BROKER_URL = 'redis://localhost:6379/0'
    CELERY_IMPORTS = ('superset.sql_lab',
                      'superset.tasks'
    )
    CELERY_RESULT_BACKEND = 'redis://localhost:6379/0'
    CELERYD_LOG_LEVEL = 'DEBUG'
    CELERYD_PREFETCH_MULTIPLIER = 10
    CELERY_ACKS_LATE = True
    CELERY_ANNOTATIONS = {
        'sql_lab.get_sql_results': {
            'rate_limit': '100/s',
        },
        'email_reports.send': {
            'rate_limit': '1/s',
            'time_limit': 120,
            'soft_time_limit': 150,
            'ignore_result': True,
        },
    }
#    CELERYBEAT_SCHEDULE = {
#        'email_reports.schedule_hourly': {
#            'task': 'email_reports.schedule_hourly',
#            'schedule': crontab(minute=1, hour='*'),
#        },
#    }

CELERY_CONFIG = CeleryConfig

# Persisting results from running query handling using Celery workers
from cachelib.redis import RedisCache
RESULTS_BACKEND = RedisCache(host='localhost', port=6379, key_prefix='superset_results')

# Disable MessagePack and PyArrow for results serialization
RESULTS_BACKEND_USE_MSGPACK = False

dosubot[bot] commented 3 months ago

Your configuration for Redis caching and Celery workers looks correct. Here are a few additional steps to ensure everything is set up properly:

Verify Redis Server: Ensure that the Redis server is running and accessible. You can check the Redis server status by running:
```
redis-cli ping
```
The response should be PONG.
Check Celery Worker Logs: Look for any errors or warnings in the Celery worker logs. These logs can provide insights into why the queries might be getting stuck in the PENDING status. You can start a Celery worker with the following command and monitor the logs:
```
celery --app=superset.tasks.celery_app:app worker --loglevel=info
```
Celery Beat: Ensure that Celery Beat is running if you have scheduled tasks. Start it with:
```
celery --app=superset.tasks.celery_app:app beat --loglevel=info
```
Async Query Execution: Ensure that the ASynchronousSqlJsonExecutor class is properly handling the execution of async queries. The execute method should be correctly implemented to start the query on a Celery worker and handle any exceptions that may occur ^[1].
Frontend Handling: Ensure that the frontend is correctly handling the async events. The asyncEvent.ts file in the Superset frontend middleware is responsible for managing async events, including polling and WebSocket connections to handle async query statuses ^[2].
Known Issues: There is a known issue in the superset-helm-chart-0.12.11 version that could affect the execution of asynchronous queries. This issue has been addressed in a fix, so ensure you are using the latest version or have applied the necessary patches ^[3].

By following these steps, you should be able to identify and resolve the issue causing asynchronous queries to get stuck in the PENDING status.

To continue talking to Dosu, mention @dosu.

rusackas commented 3 months ago

@Habeeb556 just curious what wound up working, but glad it's resolved! :D

Habeeb556 commented 3 months ago

@rusackas, I forgot to restart the service after upgrading the configurations. ^^

Habeeb556 commented 2 months ago

@dosu I encountered this error with a pending status in Celery.

error sending request for url (https://github.com/mozilla/geckodriver/releases/latest): operation timed out
Traceback (most recent call last):
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/utils/screenshots.py", line 170, in compute_and_cache
   payload = self.get_screenshot(user=user, window_size=window_size)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/utils/screenshots.py", line 93, in get_screenshot
   self.screenshot = driver.get_screenshot(self.url, self.element, user)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/utils/webdriver.py", line 354, in get_screenshot
   driver = self.auth(user)
            ^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/utils/webdriver.py", line 276, in auth
   driver = self.create()
            ^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/utils/webdriver.py", line 273, in create
   return driver_class(**kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/selenium/webdriver/firefox/webdriver.py", line 195, in __init__
   self.service.path = DriverFinder.get_path(self.service, options)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/selenium/webdriver/common/driver_finder.py", line 43, in get_path
   raise err
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/selenium/webdriver/common/driver_finder.py", line 40, in get_path
   path = shutil.which(service.path) or SeleniumManager().driver_location(options)
                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/selenium/webdriver/common/selenium_manager.py", line 91, in driver_location
   result = self.run(args)
            ^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/selenium/webdriver/common/selenium_manager.py", line 112, in run
   raise SeleniumManagerException(f"Selenium Manager failed for: {command}.\n{result}{stderr}")
selenium.common.exceptions.SeleniumManagerException: Message: Selenium Manager failed for: /swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/selenium/webdriver/common/linux/selenium-manager --browser firefox --output json.
error sending request for url (https://github.com/mozilla/geckodriver/releases/latest): operation timed out

dosubot[bot] commented 2 months ago

To resolve the timeout error when trying to fetch the latest geckodriver release from GitHub, you can increase the timeout duration for the wget command. Modify the wget command in your Dockerfile to include the --timeout option with a higher value.

Here is the modified section of the Dockerfile:

# Install GeckoDriver WebDriver
ARG GECKODRIVER_VERSION=v0.34.0 \
    FIREFOX_VERSION=125.0.3
RUN apt-get update -qq \
    && apt-get install -yqq --no-install-recommends wget bzip2 \
    && wget --timeout=60 -q https://github.com/mozilla/geckodriver/releases/download/${GECKODRIVER_VERSION}/geckodriver-${GECKODRIVER_VERSION}-linux64.tar.gz -O - | tar xfz - -C /usr/local/bin \
    # Install Firefox
    && wget --timeout=60 -q https://download-installer.cdn.mozilla.net/pub/firefox/releases/${FIREFOX_VERSION}/linux-x86_64/en-US/firefox-${FIREFOX_VERSION}.tar.bz2 -O - | tar xfj - -C /opt \
    && ln -s /opt/firefox/firefox /usr/local/bin/firefox \
    && apt-get autoremove -yqq --purge wget bzip2 && rm -rf /var/[log,tmp]/* /tmp/* /var/lib/apt/lists/*

This sets the timeout for wget to 60 seconds, which should help mitigate timeout issues. Adjust the timeout value as needed based on your network conditions ^[1].

To continue talking to Dosu, mention @dosu.

Habeeb556 commented 2 months ago

@dosu I am not using a docker

dosubot[bot] commented 2 months ago

To resolve the timeout issue when trying to fetch the latest geckodriver release from GitHub using Celery, you can try increasing the timeout settings in your Celery configuration. Specifically, you can adjust the SCREENSHOT_LOCATE_WAIT and SCREENSHOT_LOAD_WAIT settings in your superset_config.py file.

Here is an example configuration:

SCREENSHOT_LOCATE_WAIT = 200  # Increase the wait time for locating elements
SCREENSHOT_LOAD_WAIT = 1200   # Increase the wait time for loading the page

These settings will give more time for the geckodriver to fetch and load the necessary resources, potentially resolving the timeout issue ^[1].

To continue talking to Dosu, mention @dosu.

Habeeb556 commented 2 months ago

@dosu still the same issue with the below

[WARNING/ForkPoolWorker-3] Applicable driver not found; attempting to install with Selenium Manager (Beta)

Habeeb556 commented 2 months ago

++ @dosu find the below also

[WARNING/ForkPoolWorker-3] Unable to obtain driver using Selenium Manager: Selenium Manager failed for: /swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/selenium/webdriver/common/linux/selenium-manager --browser firefox --output json.
error sending request for url (https://github.com/mozilla/geckodriver/releases/latest): operation timed out
Failed at generating thumbnail Message: Selenium Manager failed for: /swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/selenium/webdriver/common/linux/selenium-manager --browser firefox --output json.

dosubot[bot] commented 2 months ago

To resolve the issue with Selenium Manager failing to obtain the geckodriver, you can manually download and install the geckodriver and ensure that it is accessible in your system's PATH. Here are the steps:

Download Geckodriver: Download the appropriate version of geckodriver for your system from the official geckodriver releases page.
```
wget https://github.com/mozilla/geckodriver/releases/download/v0.34.0/geckodriver-v0.34.0-linux64.tar.gz
```
Extract the Geckodriver: Extract the downloaded tar file.
```
tar -xvzf geckodriver-v0.34.0-linux64.tar.gz
```
Move Geckodriver to a Directory in PATH: Move the extracted geckodriver to a directory that is in your system's PATH, such as /usr/local/bin.
```
sudo mv geckodriver /usr/local/bin/
```
Verify Installation: Verify that geckodriver is installed and accessible.
```
geckodriver --version
```

Update Superset Configuration: Ensure that your Superset configuration is set to use the geckodriver. Update your superset_config.py to specify the path to the geckodriver if necessary.

from selenium import webdriver

options = webdriver.FirefoxOptions()
options.headless = True  # Run in headless mode if needed

driver = webdriver.Firefox(executable_path='/usr/local/bin/geckodriver', options=options)

By manually installing the geckodriver and ensuring it is in your PATH, you should be able to avoid the timeout issue with Selenium Manager and successfully generate thumbnails using Celery in Superset ^[1].

To continue talking to Dosu, mention @dosu.

Habeeb556 commented 2 months ago

@dosu how to resolve this issue cache in celery:

Failed at getting from cache: bdf9c4ac399103eaffbfa8b9322d6afc
INFO:superset.utils.screenshots:Failed at getting from cache: bdf9c4ac399103eaffbfa8b9322d6afc

dosubot[bot] commented 2 months ago

To resolve the caching issue in Celery, ensure that your Redis configuration is correctly set up in superset_config.py. Here is a sample configuration that uses Redis for both Celery and caching:

import os

REDIS_HOST = os.getenv("REDIS_HOST", "localhost")
REDIS_PORT = os.getenv("REDIS_PORT", "6379")
REDIS_CELERY_DB = os.getenv("REDIS_CELERY_DB", "0")
REDIS_RESULTS_DB = os.getenv("REDIS_RESULTS_DB", "1")
REDIS_CACHE_DB = os.getenv("REDIS_CACHE_DB", "4")

CACHE_CONFIG = {
    "CACHE_TYPE": "RedisCache",
    "CACHE_DEFAULT_TIMEOUT": 300,
    "CACHE_KEY_PREFIX": "superset_",
    "CACHE_REDIS_URL": f"redis://{REDIS_HOST}:{REDIS_PORT}/{REDIS_CACHE_DB}",
}
DATA_CACHE_CONFIG = CACHE_CONFIG

class CeleryConfig:
    broker_url = f"redis://{REDIS_HOST}:{REDIS_PORT}/{REDIS_CELERY_DB}"
    imports = ("superset.sql_lab", "superset.tasks.thumbnails")
    result_backend = f"redis://{REDIS_HOST}:{REDIS_PORT}/{REDIS_RESULTS_DB}"
    worker_prefetch_multiplier = 1
    task_acks_late = False
    beat_schedule = {
        "reports.scheduler": {
            "task": "reports.scheduler",
            "schedule": crontab(minute="*", hour="*"),
        },
        "reports.prune_log": {
            "task": "reports.prune_log",
            "schedule": crontab(minute=10, hour=0),
        },
    }

CELERY_CONFIG = CeleryConfig

Ensure that the environment variables REDIS_HOST, REDIS_PORT, REDIS_CELERY_DB, REDIS_RESULTS_DB, and REDIS_CACHE_DB are correctly set in your environment. This configuration uses Redis for both the Celery result backend and the cache backend, which should resolve any caching issues related to Celery ^[1]^[2].

Additionally, make sure you have the necessary dependencies installed for Redis caching:

pip install redis

If the issue persists, check the Celery worker logs for any specific error messages related to caching. These logs can provide more details about the problem ^[3].

To continue talking to Dosu, mention @dosu.

Habeeb556 commented 2 months ago

NICE! RESOLVED

REDIS_HOST = os.getenv("REDIS_HOST", "localhost")
REDIS_PORT = os.getenv("REDIS_PORT", "6379")
REDIS_CELERY_DB = os.getenv("REDIS_CELERY_DB", "2")
REDIS_RESULTS_DB = os.getenv("REDIS_RESULTS_DB", "3")
REDIS_CACHE_DB = os.getenv("REDIS_CACHE_DB", "4")

# Redis Caching
CACHE_CONFIG = {
    "CACHE_TYPE": "RedisCache",
    "CACHE_DEFAULT_TIMEOUT": 900,
    "CACHE_KEY_PREFIX": "superset_results",
    "CACHE_REDIS_URL": f"redis://{REDIS_HOST}:{REDIS_PORT}/{REDIS_CACHE_DB}",
}
DATA_CACHE_CONFIG = CACHE_CONFIG

# Long running query handling using Celery workers
class CeleryConfig(object):
    broker_url = f"redis://{REDIS_HOST}:{REDIS_PORT}/{REDIS_CELERY_DB}"
    imports = ("superset.sql_lab", "superset.tasks.thumbnails")
    result_backend = f"redis://{REDIS_HOST}:{REDIS_PORT}/{REDIS_RESULTS_DB}"
    CELERYD_LOG_LEVEL = 'DEBUG'
    CONCURRENCY = 1

CELERY_CONFIG = CeleryConfig

# Persisting results from running query handling using Celery workers
from cachelib.redis import RedisCache
RESULTS_BACKEND = RedisCache(host='localhost', port=6379, key_prefix='superset_results')

Habeeb556 commented 2 months ago

To resolve the issue with Selenium Manager failing to obtain the geckodriver, you can manually download and install the geckodriver and ensure that it is accessible in your system's PATH. Here are the steps:
Download Geckodriver: Download the appropriate version of geckodriver for your system from the official geckodriver releases page.
wget https://github.com/mozilla/geckodriver/releases/download/v0.34.0/geckodriver-v0.34.0-linux64.tar.gz
Extract the Geckodriver: Extract the downloaded tar file.
tar -xvzf geckodriver-v0.34.0-linux64.tar.gz
Move Geckodriver to a Directory in PATH: Move the extracted geckodriver to a directory that is in your system's PATH, such as /usr/local/bin.
sudo mv geckodriver /usr/local/bin/
Verify Installation: Verify that geckodriver is installed and accessible.
geckodriver --version
Update Superset Configuration: Ensure that your Superset configuration is set to use the geckodriver. Update your superset_config.py to specify the path to the geckodriver if necessary.
from selenium import webdriver

options = webdriver.FirefoxOptions()
options.headless = True  # Run in headless mode if needed

driver = webdriver.Firefox(executable_path='/usr/local/bin/geckodriver', options=options)
By manually installing the geckodriver and ensuring it is in your PATH, you should be able to avoid the timeout issue with Selenium Manager and successfully generate thumbnails using Celery in Superset [1].

To continue talking to Dosu, mention @dosu.

@dosu I got this error in log

PermissionError: [Errno 13] Permission denied: 'geckodriver.log'

dosubot[bot] commented 2 months ago

To resolve the PermissionError: [Errno 13] Permission denied: 'geckodriver.log', you need to ensure that the user running the WebDriver has the necessary permissions to write to the directory where geckodriver.log is being created.

You can set the environment variable GECKODRIVER_LOG to a directory where the user has write permissions. Here is an example of how to do this in your Python script:

import os
from selenium import webdriver

# Set the environment variable for geckodriver log path
os.environ['GECKODRIVER_LOG'] = '/path/to/logs/geckodriver.log'

options = webdriver.FirefoxOptions()
options.headless = True  # Run in headless mode if needed

driver = webdriver.Firefox(executable_path='/usr/local/bin/geckodriver', options=options)

Make sure to replace /path/to/logs/ with a directory path where the user has write permissions. This will ensure that the geckodriver.log file can be created without any permission issues.

To continue talking to Dosu, mention @dosu.

Habeeb556 commented 2 months ago

Resolved with this:

from selenium import webdriver

options = webdriver.FirefoxOptions()
options.headless = True  # Run in headless mode if needed

driver = webdriver.Firefox(executable_path='/usr/local/bin/geckodriver', options=options, log_path='/swloc/log/geckodriver.log')

But still have issue with this error in Celery log:

selenium.common.exceptions.SessionNotCreatedException: Message: Expected browser binary location, but unable to find binary in default location, no 'moz:firefoxOptions.binary' capability provided, and no binary flag set on the command line