apache / arrow-rs

Official Rust implementation of Apache Arrow
https://arrow.apache.org/
Apache License 2.0
2.62k stars 802 forks source link

Fix LIKE with escapes #6703

Closed findepi closed 1 week ago

findepi commented 2 weeks ago

Description

Fix LIKE processing for patterns containing escapes

There are two tests

Test case generation

Show The below script isn't simples possible, because it was attempted to generate more test cases by adding padding. Hence e.g. is_like_without_dangling_escape. Since this is attached for reference, should be attached as-is. ```python #!/usr/bin/env python import psycopg2 data = r""" \ \\ \\\ \\\\ a \a \\a % \% \\% %% \%% \\%% _ \_ \\_ __ \__ \\__ abc a_c a\bc a\_c %abc \%abc a\\_c% """.split('\n') # Deduplicate, keeping order data = list(dict.fromkeys(data)) conn = psycopg2.connect(host='localhost', port=5432, user='postgres', password='mysecretpassword') conn.set_session(autocommit=True) cursor = conn.cursor() for r in data: try: # PostgreSQL verifies dandling escape only sometimes cursor.execute(f"SELECT %s LIKE %s", (r, r)) is_like, = cursor.fetchone() has_dandling_escape = False pg_pattern = r except Exception as e: if 'LIKE pattern must not end with escape character' not in str(e): raise e has_dandling_escape = True pg_pattern = r + '\\' for l in data: # print() # print(' '.join(str(v) for v in (l, r, has_dandling_escape, postgres_pattern))) cursor.execute(f"SELECT %s LIKE %s", (l, pg_pattern)) is_like, = cursor.fetchone() assert type(is_like) is bool if not is_like and has_dandling_escape: pattern_without_escaped_dandling_escape = pg_pattern[:-2] cursor.execute(f"SELECT %s LIKE %s", (l, pattern_without_escaped_dandling_escape)) is_like_without_dangling_escape, = cursor.fetchone() assert type(is_like_without_dangling_escape) is bool else: is_like_without_dangling_escape = False assert '"' not in l assert '"' not in r print('(r"%s", r"%s", %s),' % ( l, r, str(is_like).lower(), # str(has_dandling_escape).lower(), # str(is_like_without_dangling_escape).lower(), )) ```

Which issue does this PR close?

Fixes https://github.com/apache/arrow-rs/issues/6702

Rationale for this change

Correctness fix for LIKE.

What changes are included in this PR?

Correctness fix for LIKE.

Are there any user-facing changes?

Yes

findepi commented 1 week ago

@alamb @tustvold please take a look

cc @samuelcolvin @Dandandan @ozankabak

tustvold commented 1 week ago

Please can you refrain from repeatedly tagging people on your PRs, we will get to your PRs when we have time, but spamming people needlessly is just rude

findepi commented 1 week ago

Apologies if i tagged you (or anyone else) repeatedly on a single PR, especially if it was in a short period of time.

findepi commented 1 week ago

Rebased to resolve conflict with https://github.com/apache/arrow-rs/pull/6662

alamb commented 1 week ago

Benchmark results (basically no change in my opinion):

++ critcmp master findepi_fix-like-with-escapes-792c56
group                                              findepi_fix-like-with-escapes-792c56    master
-----                                              ------------------------------------    ------
ilike_utf8 scalar complex                          1.00      2.6±0.07ms        ? ?/sec     1.01      2.7±0.06ms        ? ?/sec
ilike_utf8 scalar contains                         1.03      4.3±0.05ms        ? ?/sec     1.00      4.1±0.05ms        ? ?/sec
ilike_utf8 scalar ends with                        1.00  1202.7±35.59µs        ? ?/sec     1.03  1236.9±23.92µs        ? ?/sec
ilike_utf8 scalar equals                           1.00   747.4±16.29µs        ? ?/sec     1.05   784.2±23.50µs        ? ?/sec
ilike_utf8 scalar starts with                      1.00  1102.1±30.61µs        ? ?/sec     1.02  1127.3±28.58µs        ? ?/sec
ilike_utf8_scalar_dyn dictionary[10] string[4])    1.00     77.6±0.08µs        ? ?/sec     1.00     77.7±0.10µs        ? ?/sec
like_utf8 scalar complex                           1.00  1845.0±21.89µs        ? ?/sec     1.02  1889.4±18.40µs        ? ?/sec
like_utf8 scalar contains                          1.00  1549.6±14.84µs        ? ?/sec     1.01  1570.9±13.61µs        ? ?/sec
like_utf8 scalar ends with                         1.00   410.3±14.51µs        ? ?/sec     1.02   420.4±10.65µs        ? ?/sec
like_utf8 scalar equals                            1.00    127.0±0.52µs        ? ?/sec     1.00    127.0±0.27µs        ? ?/sec
like_utf8 scalar starts with                       1.00   336.7±19.19µs        ? ?/sec     1.01    339.4±5.83µs        ? ?/sec
like_utf8_scalar_dyn dictionary[10] string[4])     1.00     77.5±0.12µs        ? ?/sec     1.00     77.4±0.05µs        ? ?/sec
like_utf8view scalar complex                       1.00    182.4±0.72ms        ? ?/sec     1.00    183.1±0.44ms        ? ?/sec
like_utf8view scalar contains                      1.03    142.5±0.23ms        ? ?/sec     1.00    138.8±0.22ms        ? ?/sec
like_utf8view scalar ends with 13 bytes            1.00     44.3±0.38ms        ? ?/sec     1.08     47.8±0.10ms        ? ?/sec
like_utf8view scalar ends with 4 bytes             1.00     45.0±0.18ms        ? ?/sec     1.09     48.9±0.18ms        ? ?/sec
like_utf8view scalar ends with 6 bytes             1.00     44.8±0.13ms        ? ?/sec     1.08     48.6±0.17ms        ? ?/sec
like_utf8view scalar equals                        1.00     35.8±0.16ms        ? ?/sec     1.01     36.0±0.07ms        ? ?/sec
like_utf8view scalar starts with 13 bytes          1.01     47.8±0.15ms        ? ?/sec     1.00     47.3±0.16ms        ? ?/sec
like_utf8view scalar starts with 4 bytes           1.00     34.3±0.09ms        ? ?/sec     1.00     34.3±0.10ms        ? ?/sec
like_utf8view scalar starts with 6 bytes           1.01     48.2±0.20ms        ? ?/sec     1.00     47.8±0.15ms        ? ?/sec
nilike_utf8 scalar complex                         1.00      2.6±0.06ms        ? ?/sec     1.02      2.7±0.07ms        ? ?/sec
nilike_utf8 scalar contains                        1.03      4.2±0.06ms        ? ?/sec     1.00      4.1±0.05ms        ? ?/sec
nilike_utf8 scalar ends with                       1.00  1189.1±24.87µs        ? ?/sec     1.03  1229.4±22.89µs        ? ?/sec
nilike_utf8 scalar equals                          1.00   755.0±17.17µs        ? ?/sec     1.04   787.2±27.12µs        ? ?/sec
nilike_utf8 scalar starts with                     1.00  1124.4±45.36µs        ? ?/sec     1.01  1136.6±33.68µs        ? ?/sec
nlike_utf8 scalar complex                          1.00  1855.2±39.69µs        ? ?/sec     1.02  1887.3±29.24µs        ? ?/sec
nlike_utf8 scalar contains                         1.00  1548.3±12.27µs        ? ?/sec     1.01  1568.2±12.93µs        ? ?/sec
nlike_utf8 scalar ends with                        1.00   411.8±17.08µs        ? ?/sec     1.02    420.5±7.53µs        ? ?/sec
nlike_utf8 scalar equals                           1.00    126.7±0.14µs        ? ?/sec     1.00    127.0±0.39µs        ? ?/sec
nlike_utf8 scalar starts with                      1.00    328.4±5.42µs        ? ?/sec     1.05   345.2±13.53µs        ? ?/sec
++ popd