Closed gkoller closed 9 months ago
Not a great experience! Thanks for the report, this is something we definitely need to fix.
This has been improved greatly by @alexmojaki in #1905. One remaining change to finish fixing this issue:
The code assumes that anything between 2 dollar signs is potentially a dollar quote. But according to https://www.postgresql.org/docs/15/sql-syntax-lexical.html#SQL-SYNTAX-DOLLAR-QUOTING:
The tag, if any, of a dollar-quoted string follows the same rules as an unquoted identifier, except that it cannot contain a dollar sign
From further up:
SQL identifiers and key words must begin with a letter (a-z, but also letters with diacritical marks and non-Latin letters) or an underscore (_). Subsequent characters in an identifier or key word can be letters, underscores, digits (0-9), or dollar signs ($).
Checking that the dollar quote is valid should avoid a lot of pointless searching for a closing quote, which should resolve the use case in the issue. It also seems important for correctness.
Adding a shortcut to skip the search for the closing $ token if the char after the $ is a digit as:
diff --git a/elasticapm/instrumentation/packages/dbapi2.py b/elasticapm/instrumentation/packages/dbapi2.py
index fb49723c..51b41e68 100644
--- a/elasticapm/instrumentation/packages/dbapi2.py
+++ b/elasticapm/instrumentation/packages/dbapi2.py
@@ -33,6 +33,7 @@
https://www.python.org/dev/peps/pep-0249/
"""
+import string
import re
import wrapt
@@ -85,6 +86,7 @@ def scan(tokens):
literal_started = None
prev_was_escape = False
lexeme = []
+ digits = set(string.digits)
i = 0
while i < len(tokens):
@@ -114,6 +116,11 @@ def scan(tokens):
literal_start_idx = i
literal_started = token
elif token == "$":
+ # exclude query parameters that have a digit as next char
+ if len(tokens) > i + 1 and tokens[i + 1] in digits:
+ yield i, token
+ i += 1
+ continue
# Postgres can use arbitrary characters between two $'s as a
# literal separation token, e.g.: $fish$ literal $fish$
# This part will detect that and skip over the literal.
And the following test case replicating the reported issue:
diff --git a/tests/instrumentation/dbapi2_tests.py b/tests/instrumentation/dbapi2_tests.py
index 3d72b663..4b9b1503 100644
--- a/tests/instrumentation/dbapi2_tests.py
+++ b/tests/instrumentation/dbapi2_tests.py
@@ -122,6 +122,24 @@ def test_extract_signature_bytes():
assert actual == expected
+def test_extract_signature_pathological():
+ multiplier = 2000
+ values = []
+ for chunk in range(multiplier):
+ i = chunk * 3
+ values.append(f" (${1+i}::varchar, ${2+i}::varchar, ${3+i}::varchar), ")
+ sql = (
+ "SELECT * FROM (VALUES\n"
+ f"{''.join(values)}"
+ ")"
+ )
+
+ actual = extract_signature(sql)
+ expected = "SELECT A"
+ assert actual == expected
+
+
@pytest.mark.parametrize(
["sql", "expected"],
[
The test takes the same time with or without the shortcut. Speaking of correctness I've verified it behaves differently if I repeat the same values in the SQL query: the current code has a speedup because it finds big chunks of (invalid) $ quotes instead of going through one char at time.
I've improved the correctness a bit in #1976. If we are going to need a correct SQL parser I'd investigate using some third party library instead though. Closing.
Describe the bug:
PostgreSQL has a something called dollar-quoted string constants.
apm-agent-python
's instrumentation.packages.dbapi2.scan function has logic to detect these dollar quoted string constants and skip them.However, this logic/code path is also triggered when it comes across query parameters. After all, query parameters also start with a dollar character (
$
). For queries with a very large number of query parameters (eg, 3000-4000) this code path, as determined by running the app under PyCharm with yappi, is a severe bottle neck. Such a query can be handled in just tens of miliseconds (30-40 ms) by PostgreSQL, but it subsequently takesapm-agent-python
more than 10 sec. (!!!). Almost all the time is spent in the skip_to function.Leading to graphs like these:
What kind of a query takes up to 3000 query parameters? Queries with large VALUES lists.
BTW the above query is a simplified example. In our use case that large
VALUES
list is part of a much more elaborateCTE
(WITH
statement)To Reproduce
VALUES
list consisting of query parameters.Environment (please complete the following information)
python:3.10.11
Docker image)