AbsaOSS / spline-spark-agent

Spline agent for Apache Spark
https://absaoss.github.io/spline/
Apache License 2.0
175 stars 90 forks source link

Can Spline support lineage for AWS glue Spark dynamic frames? #786

Closed rushabh1995 closed 5 months ago

rushabh1995 commented 5 months ago

Hello everyone, Using spline JAR (spark-3.3-spline-agent-bundle_2.12 JAR 2.0.0), I'm attempting to extract lineage from Glue jobs; however, this only functions with spark DataFrame and not with glue dynamic frame. Is there any functionality in the Spline JAR or anything else that will help identify the Glue DynamicFrame's lineage?

For our UseCase, we are currently utilizing Glue 4.0.

Code:

from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("CSV to DynamicFrame").getOrCreate()

# Read the CSV file into a DataFrame
df = spark.read.format("csv").option("header", "true").load("s3://test/Employee/london_emp.csv")

# Perform transformations on the DataFrame if needed
df_transformed = df.withColumn("salary", df["salary"] * 1.10)

# Create a GlueContext
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)

# Convert the Spark DataFrame to a DynamicFrame
dynamic_frame = DynamicFrame.fromDF(df, glueContext, "dynamic_frame")

# Write the DynamicFrame to S3
glueContext.write_dynamic_frame.from_options(
    frame=dynamic_frame,
    connection_type="s3",
    connection_options={"path": "s3://test/netflix"},
    format="parquet"
)
wajda commented 5 months ago

closing as duplicate of #781