Closed findepi closed 1 week ago
@alamb @tustvold please take a look
cc @samuelcolvin @Dandandan @ozankabak
Please can you refrain from repeatedly tagging people on your PRs, we will get to your PRs when we have time, but spamming people needlessly is just rude
Apologies if i tagged you (or anyone else) repeatedly on a single PR, especially if it was in a short period of time.
Rebased to resolve conflict with https://github.com/apache/arrow-rs/pull/6662
Benchmark results (basically no change in my opinion):
++ critcmp master findepi_fix-like-with-escapes-792c56
group findepi_fix-like-with-escapes-792c56 master
----- ------------------------------------ ------
ilike_utf8 scalar complex 1.00 2.6±0.07ms ? ?/sec 1.01 2.7±0.06ms ? ?/sec
ilike_utf8 scalar contains 1.03 4.3±0.05ms ? ?/sec 1.00 4.1±0.05ms ? ?/sec
ilike_utf8 scalar ends with 1.00 1202.7±35.59µs ? ?/sec 1.03 1236.9±23.92µs ? ?/sec
ilike_utf8 scalar equals 1.00 747.4±16.29µs ? ?/sec 1.05 784.2±23.50µs ? ?/sec
ilike_utf8 scalar starts with 1.00 1102.1±30.61µs ? ?/sec 1.02 1127.3±28.58µs ? ?/sec
ilike_utf8_scalar_dyn dictionary[10] string[4]) 1.00 77.6±0.08µs ? ?/sec 1.00 77.7±0.10µs ? ?/sec
like_utf8 scalar complex 1.00 1845.0±21.89µs ? ?/sec 1.02 1889.4±18.40µs ? ?/sec
like_utf8 scalar contains 1.00 1549.6±14.84µs ? ?/sec 1.01 1570.9±13.61µs ? ?/sec
like_utf8 scalar ends with 1.00 410.3±14.51µs ? ?/sec 1.02 420.4±10.65µs ? ?/sec
like_utf8 scalar equals 1.00 127.0±0.52µs ? ?/sec 1.00 127.0±0.27µs ? ?/sec
like_utf8 scalar starts with 1.00 336.7±19.19µs ? ?/sec 1.01 339.4±5.83µs ? ?/sec
like_utf8_scalar_dyn dictionary[10] string[4]) 1.00 77.5±0.12µs ? ?/sec 1.00 77.4±0.05µs ? ?/sec
like_utf8view scalar complex 1.00 182.4±0.72ms ? ?/sec 1.00 183.1±0.44ms ? ?/sec
like_utf8view scalar contains 1.03 142.5±0.23ms ? ?/sec 1.00 138.8±0.22ms ? ?/sec
like_utf8view scalar ends with 13 bytes 1.00 44.3±0.38ms ? ?/sec 1.08 47.8±0.10ms ? ?/sec
like_utf8view scalar ends with 4 bytes 1.00 45.0±0.18ms ? ?/sec 1.09 48.9±0.18ms ? ?/sec
like_utf8view scalar ends with 6 bytes 1.00 44.8±0.13ms ? ?/sec 1.08 48.6±0.17ms ? ?/sec
like_utf8view scalar equals 1.00 35.8±0.16ms ? ?/sec 1.01 36.0±0.07ms ? ?/sec
like_utf8view scalar starts with 13 bytes 1.01 47.8±0.15ms ? ?/sec 1.00 47.3±0.16ms ? ?/sec
like_utf8view scalar starts with 4 bytes 1.00 34.3±0.09ms ? ?/sec 1.00 34.3±0.10ms ? ?/sec
like_utf8view scalar starts with 6 bytes 1.01 48.2±0.20ms ? ?/sec 1.00 47.8±0.15ms ? ?/sec
nilike_utf8 scalar complex 1.00 2.6±0.06ms ? ?/sec 1.02 2.7±0.07ms ? ?/sec
nilike_utf8 scalar contains 1.03 4.2±0.06ms ? ?/sec 1.00 4.1±0.05ms ? ?/sec
nilike_utf8 scalar ends with 1.00 1189.1±24.87µs ? ?/sec 1.03 1229.4±22.89µs ? ?/sec
nilike_utf8 scalar equals 1.00 755.0±17.17µs ? ?/sec 1.04 787.2±27.12µs ? ?/sec
nilike_utf8 scalar starts with 1.00 1124.4±45.36µs ? ?/sec 1.01 1136.6±33.68µs ? ?/sec
nlike_utf8 scalar complex 1.00 1855.2±39.69µs ? ?/sec 1.02 1887.3±29.24µs ? ?/sec
nlike_utf8 scalar contains 1.00 1548.3±12.27µs ? ?/sec 1.01 1568.2±12.93µs ? ?/sec
nlike_utf8 scalar ends with 1.00 411.8±17.08µs ? ?/sec 1.02 420.5±7.53µs ? ?/sec
nlike_utf8 scalar equals 1.00 126.7±0.14µs ? ?/sec 1.00 127.0±0.39µs ? ?/sec
nlike_utf8 scalar starts with 1.00 328.4±5.42µs ? ?/sec 1.05 345.2±13.53µs ? ?/sec
++ popd
Description
Fix LIKE processing for patterns containing escapes
There are two tests
Test case generation
Show
The below script isn't simples possible, because it was attempted to generate more test cases by adding padding. Hence e.g. is_like_without_dangling_escape. Since this is attached for reference, should be attached as-is. ```python #!/usr/bin/env python import psycopg2 data = r""" \ \\ \\\ \\\\ a \a \\a % \% \\% %% \%% \\%% _ \_ \\_ __ \__ \\__ abc a_c a\bc a\_c %abc \%abc a\\_c% """.split('\n') # Deduplicate, keeping order data = list(dict.fromkeys(data)) conn = psycopg2.connect(host='localhost', port=5432, user='postgres', password='mysecretpassword') conn.set_session(autocommit=True) cursor = conn.cursor() for r in data: try: # PostgreSQL verifies dandling escape only sometimes cursor.execute(f"SELECT %s LIKE %s", (r, r)) is_like, = cursor.fetchone() has_dandling_escape = False pg_pattern = r except Exception as e: if 'LIKE pattern must not end with escape character' not in str(e): raise e has_dandling_escape = True pg_pattern = r + '\\' for l in data: # print() # print(' '.join(str(v) for v in (l, r, has_dandling_escape, postgres_pattern))) cursor.execute(f"SELECT %s LIKE %s", (l, pg_pattern)) is_like, = cursor.fetchone() assert type(is_like) is bool if not is_like and has_dandling_escape: pattern_without_escaped_dandling_escape = pg_pattern[:-2] cursor.execute(f"SELECT %s LIKE %s", (l, pattern_without_escaped_dandling_escape)) is_like_without_dangling_escape, = cursor.fetchone() assert type(is_like_without_dangling_escape) is bool else: is_like_without_dangling_escape = False assert '"' not in l assert '"' not in r print('(r"%s", r"%s", %s),' % ( l, r, str(is_like).lower(), # str(has_dandling_escape).lower(), # str(is_like_without_dangling_escape).lower(), )) ```Which issue does this PR close?
Fixes https://github.com/apache/arrow-rs/issues/6702
Rationale for this change
Correctness fix for LIKE.
What changes are included in this PR?
Correctness fix for LIKE.
Are there any user-facing changes?
Yes