AbsaOSS / spline-spark-agent

Spline agent for Apache Spark
https://absaoss.github.io/spline/
Apache License 2.0
186 stars 96 forks source link

Checkpoint lineage support #546

Open tcluzhe opened 2 years ago

tcluzhe commented 2 years ago

code like this, while spline cannot get Input Data Source

from pyspark.sql import SparkSession

spark = (SparkSession.builder
        .config('spark.sql.queryExecutionListeners', 'za.co.absa.spline.harvester.listener.SplineQueryExecutionListener')
        .config('spark.spline.producer.url', 'http://master-1-1:8080/producer')
        .enableHiveSupport()
        .getOrCreate()
    )

def generate_data():
    data = [
            ('a', 1),
            ('b', 2),
            ]

    df = spark.createDataFrame(data, ['name', 'value'])
    df.write.saveAsTable('test.table', mode='overwrite')

def test():
    spark.sparkContext.setCheckpointDir('/tmp/checkpoint')
    df = spark.table('test.table')
    df = df.checkpoint()     ## checkpoint
    df.write.saveAsTable('test.table2', mode='overwrite')

if __name__ == '__main__':
    # generate_data()
    test()

image

katerina-glushko commented 8 months ago

+1 we would definitely appreciate this feature too (either way we're loosing a whole chunk of lineage)