AbsaOSS / spline-spark-agent

Spline agent for Apache Spark
https://absaoss.github.io/spline/
Apache License 2.0
176 stars 90 forks source link

Spline agent can't read the configuration from hadoop configuration core-site.xml #725

Closed hugeshi closed 11 months ago

hugeshi commented 11 months ago

What's the issue:

Spline agent can't read the configuration from hadoop configuration core-site.xml when I set the spline configuration in hadoop configuration. When I set the spline configuration in the spark-defaults.conf, the lineage data can be generated.

I checked the spark UI, the Hadoop folder and spline jar could be displayed in the environment view as below:

Classpath Entries: image image

But the spline configuration(spark.sql.queryExecutionListeners) can't be displayed in the Spark Properties section.

Please help advise here, thanks.

How to reproduce:

  1. Add the below properties in the core-site.xml

    <property>
     <name>spark.spline.lineageDispatcher.http.producer.url</name>
      <value>http://10.27.184.4:8080/producer</value>
    </property>
    
    <property>
      <name>spark.sql.queryExecutionListeners</name>
      <value>za.co.absa.spline.harvester.listener.SplineQueryExecutionListener</value>
    </property>
  2. Copy the spark-2.3-spline-agent-bundle_2.11-1.3.0-SNAPSHOT.jar to jars under spark home folder.
  3. execute the command
    export HADOOP_HOME=/usr/hdp/2.6.5.0-292/hadoop && export HADOOP_CONF_DIR=/home/pp_risk_grs_datamart_batch /hujshi/conf && /home/pp_risk_grs_datamart_batch/hujshi/spark/spark-2.3.1_2.11/bin/spark-submit \
    --class com.hujshi.TestApplication \
    --master yarn \
    --deploy-mode client \
    --driver-memory 2G \
    --driver-cores 2 \
    --executor-memory 8G \
    --executor-cores 2 \
    --num-executors 10 \
    --conf "spark.driver.extraJavaOptions=-Dhadoop=/home/pp_risk_grs_datamart_batch /hujshi/conf" \
    hdfs:///user/pp_risk_grs_datamart_batch/bsi_jars/spark-demo-1.0-SNAPSHOT.jar

Spline Agent Version

1.3.0-SNAPSHOT

wajda commented 11 months ago

I see two things that are done incorrectly in your example:

  1. No need to prefix Spline properties with spark. in the core-site.xml (this is only required when setting them in the Spark config):

    <property>
        <name>spline.lineageDispatcher.http.producer.url</name>
        <value>http://10.27.184.4:8080/producer</value>
    </property>
  2. spark.sql.queryExecutionListeners - you cannot set it in the core-site.xml. The core-site.xml file is a Hadoop configuration file, and it's used to set Hadoop-specific settings. While Hadoop will read this file and load the properties it contains (making them available for Spline and other libraries), Spark itself does not use this file for its own configuration. You need to set this property in the spark-defaults.conf or via --conf command-line argument.

hugeshi commented 11 months ago

Thanks for your quick response. I will close the issue.