apache / superset

Apache Superset is a Data Visualization and Data Exploration Platform
https://superset.apache.org/
Apache License 2.0
60.47k stars 13.06k forks source link

Thumbnails work for Dashboards, not for charts #29298

Open brantian opened 2 weeks ago

brantian commented 2 weeks ago

Bug description

Hello!

I was able to successfully make the Thumbnails feature work, but for some reason it works only for Dashboard thumbnails, not for Charts. I'm using Redis for cache

Based on the logs for the Web App and the Celery Worker, it seems they are generating different cache keys. So when the thumbnail is requested, I can see that the screenshot is taken and cached successfully, but the cache_key generated to save the image in the Redis server by the Celery worker is different than the cache_key used by the web to retrieve the screenshot:

When the thumbnail is the requested by calling /api/v1/chart/93/thumbnail/8a61c658366e2896618441f3f43d34ec/, this is what I see in the Web App logs:

DEFAULT 2024-06-18T22:53:35.437998Z def cache_chart_thumbnail(current_user, chart_id, force=0, window_size=1, thumb_size=2):
DEFAULT 2024-06-18T22:53:35.437985Z 2024-06-18 22:53:35,438:DEBUG:celery.utils.functional:
DEFAULT 2024-06-18T22:53:35.437714Z 2024-06-18 22:53:35,438:INFO:superset.charts.api:Triggering thumbnail compute (chart id: 93) ASYNC
DEFAULT 2024-06-18T22:53:35.437690Z Triggering thumbnail compute (chart id: 93) ASYNC
DEFAULT 2024-06-18T22:53:35.437582Z 2024-06-18 22:53:35,437:INFO:superset.utils.screenshots:Failed at getting from cache: 9085680bf97354826f9065a4ffe54d4e
DEFAULT 2024-06-18T22:53:35.437565Z Failed at getting from cache: 9085680bf97354826f9065a4ffe54d4e
DEFAULT 2024-06-18T22:53:35.430995Z 2024-06-18 22:53:35,431:INFO:superset.utils.screenshots:Attempting to get from cache: 9085680bf97354826f9065a4ffe54d4e
DEFAULT 2024-06-18T22:53:35.430910Z Attempting to get from cache: 9085680bf97354826f9065a4ffe54d4e
INFO 2024-06-18T22:53:35.325710Z https://demo.report.supersettest.net/api/v1/chart/93/thumbnail/8a61c658366e2896618441f3f43d34ec/

And the Celery Worker logs:

DEFAULT 2024-06-18T22:54:00.850626319Z [2024-06-18 22:54:00,850: INFO/ForkPoolWorker-1] Done caching thumbnail
DEFAULT 2024-06-18T22:54:00.850514593Z Done caching thumbnail
DEFAULT 2024-06-18T22:54:00.849492264Z [2024-06-18 22:54:00,849: INFO/ForkPoolWorker-1] Caching thumbnail: aee5d4760af5e3d013fc8b1b125a739a
DEFAULT 2024-06-18T22:54:00.849440036Z Caching thumbnail: aee5d4760af5e3d013fc8b1b125a739a
DEFAULT 2024-06-18T22:53:40.765951Z 2024-06-18T22:53:40.764469127Z container exec_die e7c4c95247c811b18dc3ff911a37cb65e4e8a7efd3c198cfd966169d9e997532 (base=3.10-slim-bookworm, build_actor=mistercrunch, build_trigger=4.0.0, execID=b2e799079a5fc4e4c27931d96f4f8fbc393bed7cf6df2d52392886fcc7ebe481, exitCode=7, image=us-central1-docker.pkg.dev/supersettest-root/supersettest-docker/supersettest-superset-worker:0.0.17, name=klt--rxtz, sha=c35842e9f1c987d5e684969540fea3d8a5d03ad9, target=lean, version=4.0.0)
DEFAULT 2024-06-18T22:53:40.664255Z 2024-06-18T22:53:40.663795495Z container exec_start: /bin/sh -c curl -f "http://localhost:${SUPERSET_PORT}/health" e7c4c95247c811b18dc3ff911a37cb65e4e8a7efd3c198cfd966169d9e997532 (base=3.10-slim-bookworm, build_actor=mistercrunch, build_trigger=4.0.0, execID=b2e799079a5fc4e4c27931d96f4f8fbc393bed7cf6df2d52392886fcc7ebe481, image=us-central1-docker.pkg.dev/supersettest-root/supersettest-docker/supersettest-superset-worker:0.0.17, name=klt--rxtz, sha=c35842e9f1c987d5e684969540fea3d8a5d03ad9, target=lean, version=4.0.0)
DEFAULT 2024-06-18T22:53:40.664255Z 2024-06-18T22:53:40.663727676Z container exec_create: /bin/sh -c curl -f "http://localhost:${SUPERSET_PORT}/health" e7c4c95247c811b18dc3ff911a37cb65e4e8a7efd3c198cfd966169d9e997532 (base=3.10-slim-bookworm, build_actor=mistercrunch, build_trigger=4.0.0, execID=b2e799079a5fc4e4c27931d96f4f8fbc393bed7cf6df2d52392886fcc7ebe481, image=us-central1-docker.pkg.dev/supersettest-root/supersettest-docker/supersettest-superset-worker:0.0.17, name=klt--rxtz, sha=c35842e9f1c987d5e684969540fea3d8a5d03ad9, target=lean, version=4.0.0)
DEFAULT 2024-06-18T22:53:35.662998449Z [2024-06-18 22:53:35,662: INFO/ForkPoolWorker-1] Processing url for thumbnail: aee5d4760af5e3d013fc8b1b125a739a
DEFAULT 2024-06-18T22:53:35.662384695Z Processing url for thumbnail: aee5d4760af5e3d013fc8b1b125a739a
DEFAULT 2024-06-18T22:53:35.656695014Z [2024-06-18 22:53:35,656: INFO/ForkPoolWorker-1] Caching chart: https://demo.report.supersettest.net/superset/slice/93/
DEFAULT 2024-06-18T22:53:35.656493649Z Caching chart: https://demo.report.supersettest.net/superset/slice/93/

Notice that both the Web App and Celery are working with thumbnail for chart 93, but the thumbail keys are different. For dashboards, which work sucessfully, the keys are the same.

How to reproduce the bug

#superset_config.py 

import os
import math
from superset.superset_typing import CacheConfig
from superset.tasks.types import ExecutorType
from flask_appbuilder.security.manager import AUTH_DB, AUTH_OAUTH
from custom_sso_security_manager import CustomSsoSecurityManager
from celery.schedules import crontab
from urllib.parse import urlparse
from datetime import timedelta

logoPath = os.getenv('LOGO_PATH')

LOG_LEVEL = 'DEBUG'

FLASK_APP="superset"
SECRET_KEY = os.getenv('SUPERSET_SECRET_KEY')
APP_NAME = "supersettest"
APP_ICON = logoPath or "/static/assets/images/supersettest-logo.svg"
FAVICONS = [{"href": "/static/assets/images/supersettest-favicon.png"}]

# Postgres Database Connection
DATABASE_USER = os.getenv('DATABASE_USER')
DATABASE_PASSWORD = os.getenv('DATABASE_PASSWORD')
DATABASE_HOST = os.getenv('DATABASE_HOST')
DATABASE_PORT = os.getenv('DATABASE_PORT')
DATABASE_DB_NAME = os.getenv('DATABASE_DB_NAME')

SQLALCHEMY_DATABASE_URI = f"postgresql+psycopg2://{DATABASE_USER}:{DATABASE_PASSWORD}@{DATABASE_HOST}:{DATABASE_PORT}/{DATABASE_DB_NAME}"
SQLALCHEMY_POOL_SIZE = 30
SQLALCHEMY_MAX_OVERFLOW = 30
SQLALCHEMY_POOL_TIMEOUT = 180

# Redis Connection
REDIS_HOST = os.environ.get("REDIS_HOST")
REDIS_PORT = os.environ.get("REDIS_PORT", '6379') 
REDIS_CELERY_DB = os.environ.get("REDIS_CELERY_DB", 2)  
REDIS_RESULTS_DB = os.environ.get("REDIS_RESULTS_DB", 3)  
REDIS_CACHE_DB = os.environ.get("REDIS_CACHE_DB", 4)
REDIS_RATELIMIT_DB = os.environ.get("REDIS_RATELIMIT_DB", 5)

# Security
ENABLE_PROXY_FIX = True

# Flask-WTF flag for CSRF
WTF_CSRF_ENABLED = True

# Add endpoints that need to be exempt from CSRF protection
WTF_CSRF_EXEMPT_LIST = [
    "superset.views.core.log",
    "superset.views.core.explore_json",
    "superset.charts.data.api.data",
]

TALISMAN_ENABLED=True

imgSrc = [ 
    "'self'", 
    "blob:", 
    "data:",
    "https://apachesuperset.gateway.scarf.sh",
    "https://static.scarf.sh/",
    ]

if (logoPath):
    imgSrc.append(urlparse(logoPath).netloc)

TALISMAN_CONFIG = {
    "content_security_policy": {
        "base-uri": ["'self'"],
        "default-src": ["'self'"],
        "img-src": imgSrc,
        "worker-src": ["*"],
        "connect-src": [
            "'self'",
            "https://api.mapbox.com",
            "https://events.mapbox.com",
        ],
        "object-src": "'none'",
        "style-src": [
            "'self'",
            "'unsafe-inline'",
        ],
        "script-src": ["'self'", "'strict-dynamic'"],
    },
    "content_security_policy_nonce_in": ["script-src"],
    "force_https": False,
    "session_cookie_secure": False,
}

FAB_ADD_SECURITY_API = True

# FAB Rate limiting: this is a security feature for preventing DDOS attacks. The
# feature is on by default to make Superset secure by default, but you should
# fine tune the limits to your needs. You can read more about the different
# parameters here: https://flask-limiter.readthedocs.io/en/stable/configuration.html
RATELIMIT_ENABLED = True
RATELIMIT_APPLICATION = "50 per second"
AUTH_RATE_LIMITED = True
AUTH_RATE_LIMIT = "5 per second"
# A storage location conforming to the scheme in storage-scheme. See the limits
# library for allowed values: https://limits.readthedocs.io/en/stable/storage.html
RATELIMIT_STORAGE_URI = f"redis://{REDIS_HOST}:{REDIS_PORT}/{REDIS_RATELIMIT_DB}"
# A callable that returns the unique identity of the current request.
# RATELIMIT_REQUEST_IDENTIFIER = flask.Request.endpoint

# Authentication
AUTH_TYPE = AUTH_DB if os.getenv('AUTH_TYPE') == 'DB' else AUTH_OAUTH
OAUTH_PROVIDERS = [
{
    'name': 'google',
    'icon': 'fa-google',
    'token_key': 'access_token',
    'remote_app': {
        'api_base_url': 'https://www.googleapis.com/oauth2/v2/',
        'client_kwargs': {
            'scope': 'email profile'
        },
        'request_token_url': None,
        'access_token_url': 'https://accounts.google.com/o/oauth2/token',
        'authorize_url': 'https://accounts.google.com/o/oauth2/auth',
        'client_id': '$GOOGLE_AUTH_CLIENT_ID',
        'client_secret': '$GOOGLE_AUTH_CLIENT_SECRET'
    }
}]

CUSTOM_SECURITY_MANAGER = None if os.getenv('AUTH_TYPE') == 'DB' else CustomSsoSecurityManager

FEATURE_FLAGS = {
    'ENABLE_TEMPLATE_PROCESSING': True,
    'TAGGING_SYSTEM': True,
    'THUMBNAILS': True,
    'THUMBNAILS_SQLA_LISTENERS': True,
    'ALERT_REPORTS': True,
    'ALERT_REPORT_TABS': True,
    'DASHBOARD_RBAC': True,
    'LISTVIEWS_DEFAULT_CARD_VIEW': True,
    'DRILL_BY': True

}

class CeleryConfig(object):
    broker_url = f"redis://{REDIS_HOST}:{REDIS_PORT}/{REDIS_CELERY_DB}"
    imports = (
        "superset.sql_lab",
        "superset.tasks.scheduler",
        "superset.tasks.thumbnails",
    )
    result_backend = f"redis://{REDIS_HOST}:{REDIS_PORT}/{REDIS_RESULTS_DB}"
    worker_prefetch_multiplier = 1
    worker_concurrency = 2
    task_acks_late = True
    task_annotations = {
        "sql_lab.get_sql_results": {
            "rate_limit": "100/s",
        },
    }
    beat_schedule = {
        "reports.scheduler": {
            "task": "reports.scheduler",
            "schedule": crontab(minute="*", hour="*"),
        },
        "reports.prune_log": {
            "task": "reports.prune_log",
            "schedule": crontab(minute=0, hour=0),
        },
    }

CELERY_CONFIG = CeleryConfig

CACHE_CONFIG = {
    "CACHE_TYPE": "RedisCache",
    "CACHE_DEFAULT_TIMEOUT": int(timedelta(minutes=1).total_seconds()),
    "CACHE_KEY_PREFIX": "superset_cache",
    "CACHE_REDIS_URL": f"redis://{REDIS_HOST}:{REDIS_PORT}/{REDIS_CACHE_DB}",
}

DATA_CACHE_CONFIG = {
    **CACHE_CONFIG,
    "CACHE_DEFAULT_TIMEOUT": int(timedelta(seconds=30).total_seconds()),
    "CACHE_KEY_PREFIX": "superset_data_cache",
}

FILTER_STATE_CACHE_CONFIG = {
    "CACHE_TYPE": "SimpleCache",
    "CACHE_THRESHOLD": math.inf,
    "CACHE_DEFAULT_TIMEOUT": int(timedelta(minutes=10).total_seconds()),
}

EXPLORE_FORM_DATA_CACHE_CONFIG = {
    "CACHE_TYPE": "SimpleCache",
    "CACHE_THRESHOLD": math.inf,
    "CACHE_DEFAULT_TIMEOUT": int(timedelta(minutes=10).total_seconds()),
}

THUMBNAIL_CACHE_CONFIG: CacheConfig = {
    'CACHE_TYPE': 'RedisCache',
    'CACHE_DEFAULT_TIMEOUT': 7* 86400, # 7 days
    'CACHE_KEY_PREFIX': 'thumbnail_',
    # 'CACHE_NO_NULL_WARNING': True,
    'CACHE_REDIS_HOST': REDIS_HOST,
    'CACHE_REDIS_PORT': REDIS_PORT,
    'CACHE_REDIS_DB': REDIS_CELERY_DB
}
# Async selenium thumbnail task will use the following user
THUMBNAIL_SELENIUM_USER = os.getenv('ADMIN_USER')
THUMBNAIL_EXECUTE_AS = [ExecutorType.SELENIUM]

WEBDRIVER_TYPE = "firefox"

WEBDRIVER_BASEURL = os.getenv('WEBDRIVER_BASEURL')
WEBDRIVER_OPTION_ARGS = [
        "--force-device-scale-factor=2.0",
        "--high-dpi-support=2.0",
        "--headless",
        "--disable-gpu",
        "--disable-dev-shm-usage",
        "--no-sandbox",
        "--disable-setuid-sandbox",
        "--disable-extensions",
        ]

SCREENSHOT_LOCATE_WAIT = 1000
SCREENSHOT_LOAD_WAIT = 2000

# Email configuration
SMTP_HOST = "smtp.mailersend.net"
SMTP_PORT = 587 
SMTP_STARTTLS = True
SMTP_SSL_SERVER_AUTH = True
SMTP_SSL = False
SMTP_USER = "$SMTP_USER" 
SMTP_PASSWORD = "$SMTP_PASSWORD" 
SMTP_MAIL_FROM = "noreply@supersettest.com"
EMAIL_REPORTS_SUBJECT_PREFIX = "[supersettest] "
# The base URL for the email report hyperlinks.
WEBDRIVER_BASEURL_USER_FRIENDLY = WEBDRIVER_BASEURL

Screenshots/recordings

No response

Superset version

4.0.1

Python version

3.9

Node version

16

Browser

Chrome

Additional context

No response

Checklist

dosubot[bot] commented 2 weeks ago

Hey there, @brantian! I'm a bot here to help you with bugs, questions, and becoming a contributor. I'm here to assist you while you wait for a human maintainer.

The issue you're encountering with different cache keys for chart thumbnails versus dashboard thumbnails likely stems from a discrepancy in how the cache keys are generated and used in the code.

For dashboards, the cache keys are consistent, but for charts, it seems there is a mismatch. This can be due to differences in the key prefix or the way the keys are constructed in the code.

Here are a few steps to troubleshoot and potentially resolve this issue:

  1. Check Cache Key Prefixes: Ensure that the cache key prefixes for charts and dashboards are consistent. In your superset_config.py, you have:

    THUMBNAIL_CACHE_CONFIG: CacheConfig = {
       'CACHE_TYPE': 'RedisCache',
       'CACHE_DEFAULT_TIMEOUT': 7 * 86400,  # 7 days
       'CACHE_KEY_PREFIX': 'thumbnail_',
       'CACHE_REDIS_HOST': REDIS_HOST,
       'CACHE_REDIS_PORT': REDIS_PORT,
       'CACHE_REDIS_DB': REDIS_CELERY_DB
    }

    Verify that the CACHE_KEY_PREFIX is correctly used in both the web app and Celery worker for chart thumbnails.

  2. Review Thumbnail Cache Code: Check the code in superset/tasks/thumbnails.py where the cache keys are generated and used. Ensure that the same logic is applied for both charts and dashboards.

    For example, in the cache_chart_thumbnail task, ensure the cache key is generated consistently:

    @celery_app.task(name="cache_chart_thumbnail", soft_time_limit=300)
    def cache_chart_thumbnail(current_user, chart_id, force=False, thumb_size=None):
       from superset.models.slice import Slice
    
       if not thumbnail_cache:
           logging.warning("No cache set, refusing to compute")
           return
       chart = Slice.get(chart_id)
       url = get_url_path("Superset.slice", slice_id=chart.id)
    
       logger.info("Caching chart: %s", url)
       _, username = get_executor(
           executor_types=current_app.config["THUMBNAIL_EXECUTE_AS"],
           model=chart,
           current_user=current_user,
       )
       user = security_manager.find_user(username)
       with override_user(user):
           screenshot = ChartScreenshot(url, chart.digest)
           screenshot.compute_and_cache(
               user=user,
               cache=thumbnail_cache,
               force=force,
               thumb_size=thumb_size,
           )
  3. Debug Cache Key Generation: Add debug logs to print out the cache keys being generated in both the web app and Celery worker. This will help you identify where the discrepancy occurs.

  4. Ensure Consistent Cache Configuration: Make sure that the THUMBNAIL_CACHE_CONFIG is correctly applied in both the web app and Celery worker. Sometimes, configuration might not be correctly propagated to all parts of the application.

By following these steps, you should be able to identify and resolve the issue with inconsistent cache keys for chart thumbnails. If the problem persists, you might need to dive deeper into the specific implementation details of the thumbnail caching logic in the Superset codebase [1][2].

To continue talking to Dosu, mention @dosu.