Should Spline capture Hive Warehouse Connector activity

dwarry commented 4 years ago

Background

The old HiveContext etc. classes have been deprecated since Spark 2.0, in favour of the new Hive Warehouse Connector, which seems to be necessary to interact with LLAP.

Question

Should Spline be capturing lineage operations performed through the HWC? At the moment it doesn't seem to.

I put together a minimal example which has a PySpark job that just reads from a csv file into a dataframe and then saves that as a json file, and into a Hive table. The Spline lineage only shows the data being written to the json file.

from pyspark.sql import SparkSession
from pyspark_llap import HiveWarehouseSession
import pyspark.sql.types as T

spark = SparkSession.builder.appName("SAC-Test").enableHiveSupport().getOrCreate()

sc = spark.sparkContext

sc._jvm.za.co.absa.spline.harvester \
    .SparkLineageInitializer.enableLineageTracking(spark._jsparkSession)

hive = HiveWarehouseSession.session(spark).build()

hive.setDatabase("sactest")

schema = T.StructType()
schema.add(T.StructField("col1", T.IntegerType(), False))
schema.add(T.StructField("col2", T.StringType(), False))
schema.add(T.StructField("col3", T.DateType(), False))

df = spark.read.csv("/tmp/test_data.csv", schema, header=True, quote="'")

df.write.format(HiveWarehouseSession.HIVE_WAREHOUSE_CONNECTOR).mode("append").option("table", "sac_test").save()

df.write.json("/tmp/spline_test.json", "overwrite")

which is launched by

hdfs dfs -put -f test_data.csv /tmp

HWC_JAR=local:/usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.1.4.0-315.jar 
HWC_PY=local:/usr/hdp/current/hive_warehouse_connector/pyspark_hwc-1.0.0.3.1.4.0-315.zip
SPLINE_JAR=/home/mario/spline/spark-agent-bundle-2.3-0.4.1.jar
SPLINE_URL=http://$HOSTNAME:8079/spline-rest-gateway/producer

spark-submit --master yarn \
             --deploy-mode client \
             --jars $HWC_JAR,$SPLINE_JAR \
             --driver-java-options -Dspline.producer.url=$SPLINE_URL \
             --py-files $HWC_PY \
             load_test_data.py

The lineage it captures is

Zejnilovic commented 4 years ago

Hello @dwarry, this might be a stupid question from me, but did you try to expand SAC-Test node? Right-click and it shows if I remember correctly.