SparkCompare [PARSE_SYNTAX_ERROR] if column name contains unicode symbols

capitalone / datacompy

Pandas, Polars, and Spark DataFrame comparison for humans and more!

https://capitalone.github.io/datacompy/

Apache License 2.0

420 stars 124 forks source link

SparkCompare [PARSE_SYNTAX_ERROR] if column name contains unicode symbols #280

Closed kformanowicz-dotdata closed 3 months ago

kformanowicz-dotdata commented 3 months ago

Script to reproduce:

from pyspark.sql import SparkSession
import datacompy

spark = SparkSession.builder.getOrCreate()

df1 = spark.createDataFrame(
    [
        (1, "foo"),
        (2, "bar"),
    ],
    ["id", "例"]
)

df2 = spark.createDataFrame(
    [
        (1, "foo"),
        (2, "baz"),
    ],
    ["id", "例"]
)

comp = datacompy.SparkCompare(spark, df1, df2, join_columns=["例"])
comp.report()

It seems that unicode chars are not escaped correctly when building SQL query for compare.

fdosani commented 3 months ago

Thanks for reporting. Will take a look into this shortly

fdosani commented 3 months ago

@kformanowicz-dotdata I have a fix for our new pyspark implementation here.

I'll work on getting a legacy fix also. Just an FYI the legacy spark will eventually be deprecated in favour to align on the above.