AbsaOSS / spline-spark-agent

Spline agent for Apache Spark
https://absaoss.github.io/spline/
Apache License 2.0
186 stars 94 forks source link

Should Spline capture Hive Warehouse Connector activity #34

Closed dwarry closed 2 years ago

dwarry commented 4 years ago

Background

The old HiveContext etc. classes have been deprecated since Spark 2.0, in favour of the new Hive Warehouse Connector, which seems to be necessary to interact with LLAP.

Question

Should Spline be capturing lineage operations performed through the HWC? At the moment it doesn't seem to.

I put together a minimal example which has a PySpark job that just reads from a csv file into a dataframe and then saves that as a json file, and into a Hive table. The Spline lineage only shows the data being written to the json file.

from pyspark.sql import SparkSession
from pyspark_llap import HiveWarehouseSession
import pyspark.sql.types as T

spark = SparkSession.builder.appName("SAC-Test").enableHiveSupport().getOrCreate()

sc = spark.sparkContext

sc._jvm.za.co.absa.spline.harvester \
    .SparkLineageInitializer.enableLineageTracking(spark._jsparkSession)

hive = HiveWarehouseSession.session(spark).build()

hive.setDatabase("sactest")

schema = T.StructType()
schema.add(T.StructField("col1", T.IntegerType(), False))
schema.add(T.StructField("col2", T.StringType(), False))
schema.add(T.StructField("col3", T.DateType(), False))

df = spark.read.csv("/tmp/test_data.csv", schema, header=True, quote="'")

df.write.format(HiveWarehouseSession.HIVE_WAREHOUSE_CONNECTOR).mode("append").option("table", "sac_test").save()

df.write.json("/tmp/spline_test.json", "overwrite")

which is launched by

hdfs dfs -put -f test_data.csv /tmp

HWC_JAR=local:/usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.1.4.0-315.jar 
HWC_PY=local:/usr/hdp/current/hive_warehouse_connector/pyspark_hwc-1.0.0.3.1.4.0-315.zip
SPLINE_JAR=/home/mario/spline/spark-agent-bundle-2.3-0.4.1.jar
SPLINE_URL=http://$HOSTNAME:8079/spline-rest-gateway/producer

spark-submit --master yarn \
             --deploy-mode client \
             --jars $HWC_JAR,$SPLINE_JAR \
             --driver-java-options -Dspline.producer.url=$SPLINE_URL \
             --py-files $HWC_PY \
             load_test_data.py

The lineage it captures is

image

Zejnilovic commented 4 years ago

Hello @dwarry, this might be a stupid question from me, but did you try to expand SAC-Test node? Right-click and it shows if I remember correctly.

wajda commented 4 years ago

As I understand it, out of two writes only the JSON one is captured. We'll take a look at this. Thanks.

dwarry commented 4 years ago

@wajda Yep, precisely that. Thanks.

My testing has been on an out-of-the-box install of HDP3.1.4 (kerberized) - so Spark 2.3.2, Hive 3.1.0.

Please let me know if there's anything else you need.

wajda commented 4 years ago

Ok, so it all boils down to supporting Data Source V2 We'll do it eventually, but for us it's not a priority at the moment,

wajda commented 4 years ago

todo: test it when AbsaOSS/spline#600 is implemented

wajda commented 4 years ago

Update: it doesn't seem to be possible to add support for DataSourceV2 in general. So HWC capturing should be solved specifically for this type of usage.

wajda commented 4 years ago

I still have yet to find a normal non-fat Maven artifact providing com.hortonworks.spark.sql.hive.llap.HiveWarehouseDataSourceWriter to be included in the integration-tests module.

So far I was only able to find it in hive-warehouse-connector.jar which is a fat jar and isn't suitable to be used as a dependency.

Source: https://github.com/hortonworks-spark/spark-llap/blob/branch-2.3-3.0/src/main/java/com/hortonworks/spark/sql/hive/llap/HiveWarehouseDataSourceWriter.java

wajda commented 2 years ago

@dwarry , Sorry for the long delay. Is this issue still valid?

cerveada commented 2 years ago

The Hortonworks Data Platform is already EOL software: https://endoflife.software/applications/big-data/hortonworks-data-platform-hdp

And new versions are not coming since the company was bought by Cloudera.

wajda commented 2 years ago

Ok, So I'm closing the issue as "won't fix". If there is any other or similar issue found that is related to the up-to-date software don't hesitate to let us know.