elastic / apm-agent-python

https://www.elastic.co/guide/en/apm/agent/python/current/index.html
BSD 3-Clause "New" or "Revised" License
409 stars 214 forks source link

Repeating query in django view that failed due to OperationalError (timeout) #2114

Open Pokapiec opened 1 week ago

Pokapiec commented 1 week ago

Describe the bug: When in django view some expensive query gets executed and it exceeds statement_timeout, OperationalError occurs and it makes repeatedly the same query X times (where X is probably the amount of stack traces pointing to code parts attached to QuerySet django class). Basically from what i've seen in the elasticapm codebase, elasticapm.utils.encoding.transform function executes repr internal python func on QuerySet class and under the hood, it refetches the whole query.

In our project we use forked version of elastic-apm library with small change where we add info about current tenant to each transaction in celery.

To Reproduce

  1. Write django view where you execute some ORM SELECT query
  2. Set statement_timeout session variable to some low value like 1ns to simulate OperationalError (SET LOCAL statement_timeout='1ns')
  3. Log all sql statements executed in view in django server logs (see repeating statement)

Environment (please complete the following information)

Additional context

Add any other context about the problem here.

ELASTIC_APM = {
    "ENABLED": USE_ELASTIC_APM,
    "SERVICE_NAME": "tenants",
    "SECRET_TOKEN": config("ELASTIC_APM_SECRET_TOKEN", cast=str),
    "SERVER_URL": config("ELASTIC_APM_SERVER_URL", cast=str),
    "ENVIRONMENT": config("ELASTIC_APM_ENVIRONMENT", cast=str, default="production"),
    "TRANSACTION_SAMPLE_RATE": config("ELASTIC_APM_TRANSACTION_SAMPLE_RATE", cast=float, default="0.01"),
    "PROCESSORS": (
        "core.elasticapm.sanitize_email",
        "core.elasticapm.sanitize_cookie",
        "core.elasticapm.add_transaction_sampling",
        "core.elasticapm.drop_unsampled_spans",
        "core.elasticapm.add_transaction_url_name",
        "core.elasticapm.add_sql_context",
    ),
    "DEBUG": config("ELASTIC_APM_DEBUG", cast=bool, default=False),
    "CAPTURE_HEADERS": False,  # No need for HTTP headers
    "LOG_LEVEL": "off",  # We caputre logs not via APM but via filebeat
    "AUTO_LOG_STACKS": False,  # Exceptions stack traces are kept in sentry
}

aioapns = "3.2" beautifulsoup4 = "~4.12" bleach = "~6.1" boto3 = "~1.34" celery = "5.4.0" chardet = "~5.2" dateparser = "~1.2" Django = "4.2.15" djangorestframework = "~3.15" django-cacheops = "7.0.2" django-countries = "~7.4" django-cors-headers = "~4.4.0" django-filter = "~23.1" django-fsm = "~2.8" django-ipware = "~5.0" django-mptt = "~0.14" django-ordered-model = "~3.7" django-recaptcha = "~3.0" django-storages = "~1.13" dnspython = "~2.6" drf-yasg = "~1.21" easy-thumbnails = "~2.8" factory-boy = "~3.3" fcm-django = "~2.1.0" geoip2 = "~4.1.0" google-api-python-client = "~1.12.8" hashids = "~1.3.1" html2text = "~2020.1.16" IMAPClient = "~2.3" lupa = "2.1" messente-api = "1.5.0" msal = "~1.25" msgpack = "~1.0.0" oauth2client = "~4.1.3" phonenumberslite = "~8.12.3" psycopg2-binary = "~2.9.5" python-decouple = "~3.4" python-facebook-api = "0.18.0" python-json-logger = "2.0.2" python-magic = "~0.4" python-slugify = "~4.0.1" pytz = "~2021.3" pywebpush = "~1.11.0" redis = "4.5.4" sentry-sdk = "1.45.0" ShopifyAPI = "9.0.0" smsapi-client = "2.9.6" tqdm = "~4.66" twilio = "9.2.3" ujson = "~5.8" user-agents = "~2.2.0" google-auth = "2.29.0"

used by third-party apps

Brotli = "~1.0.9" clicksend-client = "^5.0.72" html5lib = "~1.1" lxml = "~5.1.0" Pillow = "~10.3" celery-redbeat = "2.2.0" ipdb = "0.13.4" ipython = "~8.18" libmagic = "~1.0" ndg-httpsclient = "~0.5.1" uWSGI = "2.0.26"

[tool.poetry.group.internal.dependencies]

This is just version of elastic-apm=6.13.1 but with minor patch

elastic-apm = { git = "https://poetry:xxx@some-domain.com/developers/elasticapm_tenants.git", tag = "v6.13.1-our" }

[tool.poetry.dev-dependencies] argh = "0.31.2" django-debug-toolbar = "" django-extensions = "~3.2.0" pyroscope-io = "~0.8" tblib = "" watchdog = "2.1.9" black = "22.12.0" coverage = "~7.0.0" dotmap = "~1.3.8" fakeredis = "~2.22" flake8 = "6.0.0" safety = "~3.2" selenium = "~3.141.0" sphinx = "6.2.1" recommonmark = "0.7.1" nbsphinx = "0.8.2" sphinx-rtd-theme = "0.5.1" sphinx-autobuild = "2021.3.14"


</details>
xrmx commented 1 week ago

Thanks for reporting. I think this is expected given how the Django QuerySet works, they are lazy and evaluated only when used. Said that some special casing for the case may be possible if the QuerySet has not been already evaluated and there is a way to know it is not already cached?

Pokapiec commented 1 week ago

If the query to database was too heavy to be evaluated there is no way it was cached. You can only cache QuerySet that is computable (it's execution time fits in statement timeout).

xrmx commented 1 week ago

If the query to database was too heavy to be evaluated there is no way it was cached. You can only cache QuerySet that is computable (it's execution time fits in statement timeout).

Exactly, so if we don't find it cached it means we should not evaluate it.

Pokapiec commented 1 week ago

Are you saying that there is some kind of safe-guard against evaluating non-cached QuerySets ? because i could not find such condition in library source code. This util function elasticapm.utils.encoding.transform does not consider if object is of type QuerySet and therefore it tries evaluating QuerySet (at this line https://github.com/elastic/apm-agent-python/blob/main/elasticapm/utils/encoding.py#L149) repeatedly for every stack trace regardless of the cache in it.

If it gives more context, the process of repeating SQL queries happens when got_request_exception django signal is being sent.

xrmx commented 1 week ago

Are you saying that there is some kind of safe-guard against evaluating non-cached QuerySets ?

No, I'm saying that it may be possible to add