AbsaOSS / spline-spark-agent

Spline agent for Apache Spark
https://absaoss.github.io/spline/
Apache License 2.0
175 stars 90 forks source link

Support for AWS Glue DynamicFrame #781

Open catworlddomination opened 5 months ago

catworlddomination commented 5 months ago

When adding the Spline agent bundle to an AWS Glue Python script (Spark 3.3, Python 3), lineage is produced when using the standard patterns like df = spark.read.csv(file_path, header=True, inferSchema=True) and df.write... as expected.

However, AWS Glue does have a concept of Dynamic Frames usage of which which looks something like

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)

# Script generated for node Amazon S3
f0 = glueContext.create_dynamic_frame.from_options(
    format_options={
        "quoteChar": -1,
        "withHeader": True,
        "separator": ",",
        "optimizePerformance": False,
    },
    connection_type="s3",
    format="csv",
    connection_options={
    ## Need to replace with S3 path with file name
        "paths": ["s3://...."],
        "recurse": True,
    },
    transformation_ctx="f0",
)

# Script generated for node Amazon S3
f1 = glueContext.write_dynamic_frame.from_options(
    frame=f0,
    connection_type="s3",
    format="csv",
    connection_options={
    ## Need to replace with S3 path
        "path": "s3://....",
        "partitionKeys": [],
    },
    transformation_ctx="f1",
)

job.commit()

Can Spline support this dynamic frame pattern in AWS Glue? I used the spark-3.3-spline-agent-bundle_2.12-2.0.0.jar bundle - Spline agent initialized successfully, but could not produce lineage.

wajda commented 5 months ago

I didn't try it specifically, but from the AWS doc on the DynamicFrame there is a chance that it would not work. The crucial thing for Spline agent is the existence of the internal Spark write event that the agent can intercept and grab the execution plan from it. That only exists in the Spark SQL API, meaning the DataFrame. For instance RDD lineage isn't supported because of that very reason - Spark doesn't provide any usable (for lineage purposes) logical plan on RDD operations. I don't know how exactly the DynamicFrame is implemented (it's closed source), so it's unclear if DynamicFrame operations eventually translate to DataFrame ones or not. If they don't, Spline don't have ability to track them.

wajda commented 5 months ago

Try to look at the Spark driver's debug logs carefully. If Spline agent is notified on Glue write events there have to be messages. See: