AbsaOSS / spline-spark-agent

Spline agent for Apache Spark
https://absaoss.github.io/spline/
Apache License 2.0
185 stars 95 forks source link

Is it possible to get lineage from spark execution plan? #659

Closed liujiawen closed 1 year ago

liujiawen commented 1 year ago

Background

I found in my production environment, it's hard to add listener to the spark shell. Then i thought it will be ok to get execution plan of spark jobs from the history server. Is it possible to use the execution plan i downloaded instead of spark job listener? Thank you!

Feature

Add a feature to read spark execution plan file, get lineage and send data to the spline producer.

Example [Optional]

example logical plan: InsertIntoHadoopFsRelationCommand s3://example/dw/ods/ods_example_table, [dt=2023-04-18], false, [dt#62], Parquet, [field.delim=, line.delim= , serialization.format=, partitionOverwriteMode=DYNAMIC, parquet.compression=SNAPPY, mergeSchema=false], Overwrite, CatalogTable( Database: default Table: ods_example_table Owner: hdfs Created Time: Wed Apr 19 16:03:44 CST 2023 Last Access: UNKNOWN Created By: Spark 2.2 or prior Type: EXTERNAL Provider: hive Table Properties: [bucketing_version=2, parquet.compression=SNAPPY, serialization.null.format=, transient_lastDdlTime=1681891424] Location: s3://example/dw/ods/ods_example_table Serde Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat Storage Properties: [serialization.format=, line.delim= , field.delim=] Partition Provider: Catalog Partition Columns: [dt] Schema: root |-- seller_id: string (nullable = true) |-- code: string (nullable = true) |-- country_code: string (nullable = true) |-- org_code: string (nullable = true) |-- name_cn: string (nullable = true) |-- name_en: string (nullable = true) |-- api: string (nullable = true) |-- create_time: string (nullable = true) |-- creater: string (nullable = true) |-- update_time: string (nullable = true) |-- operator: string (nullable = true) |-- status: string (nullable = true) |-- org_code_add: string (nullable = true) |-- user_name: string (nullable = true) |-- is_online: string (nullable = true) |-- dt: string (nullable = true) ), org.apache.spark.sql.execution.datasources.CatalogFileIndex@a03e7be2, [seller_id, code, country_code, org_code, name_cn, name_en, api, create_time, creater, update_time, operator, status, org_code_add, user_name, is_online, dt] +- Project [seller_id#47, code#48, country_code#49, org_code#50, name_cn#51, name_en#52, api#53, create_time#54, creater#55, update_time#56, operator#57, status#58, org_code_add#59, user_name#60, is_online#61, cast(2023-04-18 as string) AS dt#62] +- Project [cast(seller_id#0 as string) AS seller_id#47, cast(code#1 as string) AS code#48, cast(country_code#2 as string) AS country_code#49, cast(org_code#3 as string) AS org_code#50, cast(name_cn#4 as string) AS name_cn#51, cast(name_en#5 as string) AS name_en#52, cast(api#6 as string) AS api#53, cast(create_time#7 as string) AS create_time#54, cast(creater#8 as string) AS creater#55, cast(update_time#9 as string) AS update_time#56, cast(operator#10 as string) AS operator#57, cast(status#11 as string) AS status#58, cast(org_code_add#12 as string) AS org_code_add#59, cast(user_name#13 as string) AS user_name#60, cast(is_online#14 as string) AS is_online#61] +- Project [seller_id#0, code#1, country_code#2, org_code#3, name_cn#4, name_en#5, api#6, create_time#7, creater#8, update_time#9, operator#10, status#11, org_code_add#12, user_name#13, is_online#14] +- SubqueryAlias spark_catalog.dbinit.ods_example_table +- HiveTableRelation [dbinit.ods_example_table, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [seller_id#0, code#1, country_code#2, org_code#3, name_cn#4, name_en#5, api#6, create_time#7, cre..., Partition Cols: []]

Proposed Solution [Optional]

wajda commented 1 year ago

No, Spark Agent is meant to be used as a listener. The execution plan representation from the Spark history server isn't enough. Such functionality however could be implemented as another project, something like Spline Agent for Spark History Server.

liujiawen commented 1 year ago

Spline Agent for Spark History Server

No, Spark Agent is meant to be used as a listener. The execution plan representation from the Spark history server isn't enough. Such functionality however could be implemented as another project, something like Spline Agent for Spark History Server.

Thank you wajda for your instructive answer!