G-Research / spark-extension

A library that provides useful extensions to Apache Spark and PySpark.
Apache License 2.0
193 stars 26 forks source link

Pyspark - Import Error #231

Closed VinothKanna007 closed 7 months ago

VinothKanna007 commented 7 months ago

Error:

Invalid Syntax while running import

FYI:

Spark Version: 3.1.1-amzn-0

Python Version - 3.6.10 | Anaconda, Inc.

Jar: uk.co.gresearch.spark:spark-extension_2.12:2.11.0-3.5"

IMG_20240128_162821

VinothKanna007 commented 7 months ago

Now i have switched to different version of jar

Now I'm getting an Error in diff_with_options Method. Attached the Full Stacktrace below

Spark Version: 3.1.1-amzn-0

Python Version - 3.6.10 | Anaconda, Inc.

Jar: uk.co.gresearch.spark:spark-extension_2.13:2.7.0-3.4"

IMG_20240128_170826

IMG_20240128_170803

EnricoMi commented 7 months ago

The invalid syntax error reported in the description is due to using unsupported Python 3.6. Please use Python 3.7 or above.

The NoClassDefFoundError is due to using the Scala 2.13 version with PySpark, which uses Scala 2.12. Please use spark-extension_2.12 instead.

VinothKanna007 commented 7 months ago

Thanks! it works.

One more question: Is there any option to ignore the match value while displaying. Since im not bothered about Matching records

Basically i want only the records > epsilon value.

Reason: I checked the query plan(lot of case statements). And it takes more time when i'm dealing with large dataframes. I want to find only my mismatch records with a minimal time

EnricoMi commented 7 months ago

Sure, use the sparse mode: https://github.com/G-Research/spark-extension/blob/master/DIFF.md#sparse-mode

VinothKanna007 commented 7 months ago

Cool. Thanks @EnricoMi

Btw this is a great package. Loved it♥️