Closed leewalter78 closed 2 years ago
I know without being able to reproduce it locally makes it nearly impossible to analyze - but was wondering if anyone has seen a similar behavior?
Since the C, D, and N types work - there is a workaround - Do a left anti join on the id columns with the Right Datatframe and the recon results where diff != I. Basically, remove the C,D, and N rows from the right dataframe and you will be left with the inserts.
@leewalter78 Did you able to fix this?
@leewalter78 do you see the same issue? Can you provide steps for me to reproduce this? "On AWS" does not really help.
No, I was not able to resolve - I ended up using a custom solution with set differences and then joining on the keys .
At the time, I uploaded the jar ( spark-extension_2.11-1.3.3-2.4.jar) to an s3 bucket, and referenced it from the glue job under python lib path.
Then, I believe this was the sample snippet I had tried - (I forget all the details, but there were cases were this did not work - from my post, it was inserts where all the values would be NULL
from gresearch.spark.diff import *
left = spark.createDataFrame([(1, "one"), (2, "two"), (3, "three")], ["id", "value"]) right = spark.createDataFrame([(1, "one"), (2, "Two"), (4, "four")], ["id", "value"]) left.diff(right, "id").show()
I believe the sample code worked above, but when I used my actual dataframes, the values for the Inserts were all null
I just ran that sample snipped on Glue 3.0 Python with spark-extension_2.12-2.1.0-3.1.jar
and it worked:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
job.commit()
left = spark.createDataFrame([(1, "one"), (2, "two"), (3, "three")], ["id", "value"])
right = spark.createDataFrame([(1, "one"), (2, "Two"), (4, "four")], ["id", "value"])
from gresearch.spark.diff import *
left.diff(right, "id").show()
+----+---+----------+-----------+
|diff| id|left_value|right_value|
+----+---+----------+-----------+
| D| 3| three| null|
| N| 1| one| one|
| C| 2| two| Two|
| I| 4| null| four|
+----+---+----------+-----------+
I just ran that sample snipped on Glue 3.0 Python with
spark-extension_2.12-2.1.0-3.1.jar
and it worked:import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job ## @params: [JOB_NAME] args = getResolvedOptions(sys.argv, ['JOB_NAME']) sc = SparkContext() glueContext = GlueContext(sc) spark = glueContext.spark_session job = Job(glueContext) job.init(args['JOB_NAME'], args) job.commit() left = spark.createDataFrame([(1, "one"), (2, "two"), (3, "three")], ["id", "value"]) right = spark.createDataFrame([(1, "one"), (2, "Two"), (4, "four")], ["id", "value"]) from gresearch.spark.diff import * left.diff(right, "id").show()
+----+---+----------+-----------+ |diff| id|left_value|right_value| +----+---+----------+-----------+ | D| 3| three| null| | N| 1| one| one| | C| 2| two| Two| | I| 4| null| four| +----+---+----------+-----------+
Can you please specify the exact Job details
section, both Basic and Advanced Properties?
Here are the job details:
For what it's worth, I was able to get the same result as Enrico.
Looks like spark-extensions work on AWS Glue and the reported issue is not reproducible. Closing for now.
I found this project when trying to compare dataframes using pyspark, and it works appears to work great. I am seeing an issue when running this as part of an AWS Glue job with this jar - spark-extension_2.11-1.3.3-2.4.jar. Locally, it works fine and I cannot reproduce this issue.
If I create the dataframes using spark.createDataFrame().. and do a diff, I get the results I expect. When I read the dataframes from a glue context, after a diff, the Inserts columns are all null. I tried with and without unique identifiers, and I also reduced the number of rows and columns. I thought it might be Sparse mode causing this - so I tried both true and false on the sparse mode.
I created a simple case where the left dataframe has 1 row and the right dataframe has 2. One is the same row, and the diff identifies that and it identifies the Insert as well - but it loses all the values
The Deletes and No Changes rows are as expected, and currently, I only see the issue with Inserts. I have not tested Changes yes - been stuck on Inserts.
Any insights on what might be causing this?
Sample outlook +----+---------+---------+------------------+-----------+-------------+--------------------+--------------+ |diff| col1|col2| col2|col4|col4| col5|col6|col7|col8|col9| +----+---------+---------+------------------+-----------+-------------+--------------------+--------------+ | N|123456789| ABCDE|Test| 670| 4|Description...| 77.0| FL| null| null| | I| null| null| null| null| null| null| null| null| null| null| +----+---------+---------+------------------+-----------+-------------+--------------------+--------------+
Note: I also tried 1.3.0 and the same thing is happening (spark-extension_2.11-1.3.0-2.4)