Skipping PostgreSQL dollar-quoted string constants extremely slow for queries with many query parameters

gkoller commented 1 year ago

Describe the bug:

PostgreSQL has a something called dollar-quoted string constants. apm-agent-python's instrumentation.packages.dbapi2.scan function has logic to detect these dollar quoted string constants and skip them.

However, this logic/code path is also triggered when it comes across query parameters. After all, query parameters also start with a dollar character ($). For queries with a very large number of query parameters (eg, 3000-4000) this code path, as determined by running the app under PyCharm with yappi, is a severe bottle neck. Such a query can be handled in just tens of miliseconds (30-40 ms) by PostgreSQL, but it subsequently takes apm-agent-python more than 10 sec. (!!!). Almost all the time is spent in the skip_to function.

Leading to graphs like these:

What kind of a query takes up to 3000 query parameters? Queries with large VALUES lists.

SELECT *
FROM (VALUES ($1::varchar, $2::varchar, $3::varchar),
             ($4::varchar, $5::varchar, $6::varchar),
             ($7::varchar, $8::varchar, $9::varchar),
             ...
             ($2992::varchar, $2993::varchar, $2994::varchar),
             ($2995::varchar, $2996::varchar, $2997::varchar),
             ($2998::varchar, $2999::varchar, $3000::varchar))

BTW the above query is a simplified example. In our use case that large VALUES list is part of a much more elaborate CTE (WITH statement)

To Reproduce

Use SQL queries with a large VALUES list consisting of query parameters.

Environment (please complete the following information)

OS:
- Windows 10
- Linux (whatever is in the python:3.10.11 Docker image)
- MacOS 12.6.6 on ARM
- .... in other words it is not OS specific
Python version:
- 3.10.11
- 3.11.4
Framework and version [e.g. Django 2.1]:
- FastAPI 0.96.0
APM Server version:
- 8.6.2
Agent version:
- (elastic-apm) 6.16.1

basepi commented 1 year ago

Not a great experience! Thanks for the report, this is something we definitely need to fix.

basepi commented 1 year ago

This has been improved greatly by @alexmojaki in #1905. One remaining change to finish fixing this issue:

The code assumes that anything between 2 dollar signs is potentially a dollar quote. But according to https://www.postgresql.org/docs/15/sql-syntax-lexical.html#SQL-SYNTAX-DOLLAR-QUOTING:

The tag, if any, of a dollar-quoted string follows the same rules as an unquoted identifier, except that it cannot contain a dollar sign

From further up:

SQL identifiers and key words must begin with a letter (a-z, but also letters with diacritical marks and non-Latin letters) or an underscore (_). Subsequent characters in an identifier or key word can be letters, underscores, digits (0-9), or dollar signs ($).

Checking that the dollar quote is valid should avoid a lot of pointless searching for a closing quote, which should resolve the use case in the issue. It also seems important for correctness.

xrmx commented 9 months ago

Adding a shortcut to skip the search for the closing $ token if the char after the $ is a digit as:

diff --git a/elasticapm/instrumentation/packages/dbapi2.py b/elasticapm/instrumentation/packages/dbapi2.py
index fb49723c..51b41e68 100644
--- a/elasticapm/instrumentation/packages/dbapi2.py
+++ b/elasticapm/instrumentation/packages/dbapi2.py
@@ -33,6 +33,7 @@
 https://www.python.org/dev/peps/pep-0249/
 """

+import string
 import re

 import wrapt
@@ -85,6 +86,7 @@ def scan(tokens):
     literal_started = None
     prev_was_escape = False
     lexeme = []
+    digits = set(string.digits)

     i = 0
     while i < len(tokens):
@@ -114,6 +116,11 @@ def scan(tokens):
                 literal_start_idx = i
                 literal_started = token
             elif token == "$":
+                # exclude query parameters that have a digit as next char
+                if len(tokens) > i + 1 and tokens[i + 1] in digits:
+                    yield i, token
+                    i += 1
+                    continue
                 # Postgres can use arbitrary characters between two $'s as a
                 # literal separation token, e.g.: $fish$ literal $fish$
                 # This part will detect that and skip over the literal.

And the following test case replicating the reported issue:

diff --git a/tests/instrumentation/dbapi2_tests.py b/tests/instrumentation/dbapi2_tests.py
index 3d72b663..4b9b1503 100644
--- a/tests/instrumentation/dbapi2_tests.py
+++ b/tests/instrumentation/dbapi2_tests.py
@@ -122,6 +122,24 @@ def test_extract_signature_bytes():
     assert actual == expected

+def test_extract_signature_pathological():
+    multiplier = 2000
+    values = []   
+    for chunk in range(multiplier):
+        i = chunk * 3
+        values.append(f" (${1+i}::varchar, ${2+i}::varchar, ${3+i}::varchar), ")
+    sql = (
+      "SELECT * FROM (VALUES\n"
+      f"{''.join(values)}"
+      ")"
+    )
+
+    actual = extract_signature(sql)
+    expected = "SELECT A"
+    assert actual == expected
+
+
 @pytest.mark.parametrize(
     ["sql", "expected"],
     [

The test takes the same time with or without the shortcut. Speaking of correctness I've verified it behaves differently if I repeat the same values in the SQL query: the current code has a speedup because it finds big chunks of (invalid) $ quotes instead of going through one char at time.

xrmx commented 9 months ago

I've improved the correctness a bit in #1976. If we are going to need a correct SQL parser I'd investigate using some third party library instead though. Closing.

elastic / apm-agent-python

Skipping PostgreSQL dollar-quoted string constants extremely slow for queries with many query parameters #1851