apache / incubator-gluten

Gluten is a middle layer responsible for offloading JVM-based SQL engines' execution to native engines.
https://gluten.apache.org/
Apache License 2.0
1.08k stars 388 forks source link

[VL] Results are mismatch with vanilla Spark when use regexp_replace('a{bc', '\\{', '\\[') #6224

Open kecookier opened 2 weeks ago

kecookier commented 2 weeks ago

Backend

VL (Velox)

Bug description

spark-sql> set spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.ConstantFolding;
spark-sql> select regexp_replace('a{bc', '\\{', '\\[');

vanilla get
a[bc

gluten get
abc

Spark version

Spark-3.2.x

Spark configurations

No response

System information

Velox System Info v0.0.2 Commit: 3a459abad5af1a8cc268c349bc3813249511768c CMake Version: 3.22.0 System: Linux-3.10.0-862.mt20190308.130.el7.x86_64 Arch: x86_64 CPU Name: Model name: Intel(R) Xeon(R) Platinum 8255C CPU @ 2.50GHz C++ Compiler: /opt/rh/devtoolset-9/root/usr/bin/c++ C++ Compiler Version: 9.3.1 C Compiler: /opt/rh/devtoolset-9/root/usr/bin/cc C Compiler Version: 9.3.1 CMake Prefix Path: /usr/local;/usr;/;/usr/local/cmake;/usr/local;/usr/X11R6;/usr/pkg;/opt

Relevant logs

No response

gaoyangxiaozhu commented 2 weeks ago

thanks @kecookier do you will help fix the issue ?

kecookier commented 2 weeks ago

thanks @kecookier do you will help fix the issue ?

Yes, I will work on this issue.

wang-zhun commented 1 week ago

@kecookier I also encountered this issue, the reason is differences in handling \\

present situation
--- Query
with tb as (
select
    name
FROM Values
    ('12234') people (name)
)
select regexp_replace(name, '22', '\\|') c1, 'gluten-spark' c2 from (select /*+ repartition(2) */ * from tb)
union
select regexp_replace(name, '22', '\\|') c1, 'vanilla-spark' c2 from tb

-- Result
c1            c2
1|34      vanilla-spark
134       gluten-spark
Gluten Velox RE2
1. code
#include <iostream>
#include <re2/re2.h>

int main() {
    std::string input = "12234";
    RE2 pattern("22");
    std::string replacement = "\\|";
    int replacements = RE2::GlobalReplace(&input, pattern, replacement);
    std::cout << "Modified string: " << input << std::endl;
    return 0;
}

2. output:
re2/re2.cc:1059: invalid rewrite pattern: \|
Modified string: 134

3. compile:
g++ -o example example.cpp -L/usr/local/lib -lre2 -Wl,-rpath,/usr/local/lib
Spark java.util.regex
1. code
    import java.util.regex.Matcher
    import java.util.regex.Pattern

    val input = "12234"
    val pattern: Pattern = Pattern.compile("22")
    val replacement = "\\|"
    val matcher: Matcher = pattern.matcher(input)
    val result = matcher.replaceAll(replacement)
    println(s"Modified string: $result")

2. output:
Modified string: 1|34
kecookier commented 1 week ago

@wang-zhun Sorry for the late reply, and thank you for the information. I have a work-in-progress PR that will fix this issue.

PHILO-HE commented 1 week ago

Posting code where re2 error happens for replacement string with "\\": https://github.com/google/re2/blob/6dcd83d60f7944926bfd308cc13979fc53dd69ca/re2/re2.cc#L1053.

While java.util.regex.Matcher views characters after "\\" as literal, i.e., using what it is after "\\".

Can we just revise the replacement string to let Re2 do the correct replacement?