AbsaOSS / spline-spark-agent

Spline agent for Apache Spark
https://absaoss.github.io/spline/
Apache License 2.0
185 stars 95 forks source link

error occurred during lineage processing for excelPlugin? #652

Closed jinmu0410 closed 1 year ago

jinmu0410 commented 1 year ago
java.lang.NoSuchFieldException: org.apache.hadoop.hdfs.client.HdfsDataInputStream.file
    at za.co.absa.commons.reflect.ValueExtractor.$anonfun$extract$2(ValueExtractor.scala:39)
    at scala.Option.getOrElse(Option.scala:189)
    at za.co.absa.commons.reflect.ValueExtractor.extract(ValueExtractor.scala:39)
    at za.co.absa.commons.reflect.ReflectionUtils$.extractValue(ReflectionUtils.scala:140)
    at za.co.absa.commons.reflect.ReflectionUtils$.extractFieldValue(ReflectionUtils.scala:116)
    at za.co.absa.commons.reflect.ReflectionUtils$.extractValue(ReflectionUtils.scala:146)
    at za.co.absa.spline.harvester.plugin.embedded.ExcelPlugin$$anonfun$baseRelationProcessor$1.applyOrElse(ExcelPlugin.scala:46)
    at za.co.absa.spline.harvester.plugin.embedded.ExcelPlugin$$anonfun$baseRelationProcessor$1.applyOrElse(ExcelPlugin.scala:41)
    at scala.PartialFunction$OrElse.apply(PartialFunction.scala:172)
    at za.co.absa.spline.harvester.plugin.embedded.ElasticSearchPlugin$$anonfun$baseRelationProcessor$1.applyOrElse(ElasticSearchPlugin.scala:39)
    at za.co.absa.spline.harvester.plugin.embedded.ElasticSearchPlugin$$anonfun$baseRelationProcessor$1.applyOrElse(ElasticSearchPlugin.scala:39)
    at scala.PartialFunction$OrElse.apply(PartialFunction.scala:172)
    at za.co.absa.spline.harvester.plugin.embedded.CobrixPlugin$$anonfun$baseRelationProcessor$1.applyOrElse(CobrixPlugin.scala:34)
    at za.co.absa.spline.harvester.plugin.embedded.CobrixPlugin$$anonfun$baseRelationProcessor$1.applyOrElse(CobrixPlugin.scala:34)
    at scala.PartialFunction$OrElse.apply(PartialFunction.scala:172)
    at za.co.absa.spline.harvester.plugin.embedded.CassandraPlugin$$anonfun$baseRelationProcessor$1.applyOrElse(CassandraPlugin.scala:38)
    at za.co.absa.spline.harvester.plugin.embedded.CassandraPlugin$$anonfun$baseRelationProcessor$1.applyOrElse(CassandraPlugin.scala:38)
    at scala.PartialFunction$OrElse.apply(PartialFunction.scala:172)
    at za.co.absa.spline.harvester.plugin.embedded.BigQueryPlugin$$anonfun$baseRelationProcessor$1.applyOrElse(BigQueryPlugin.scala:50)
    at za.co.absa.spline.harvester.plugin.embedded.BigQueryPlugin$$anonfun$baseRelationProcessor$1.applyOrElse(BigQueryPlugin.scala:50)
    at scala.PartialFunction$OrElse.apply(PartialFunction.scala:172)
    at za.co.absa.spline.harvester.plugin.composite.LogicalRelationPlugin$$anonfun$1.applyOrElse(LogicalRelationPlugin.scala:37)
    at za.co.absa.spline.harvester.plugin.composite.LogicalRelationPlugin$$anonfun$1.applyOrElse(LogicalRelationPlugin.scala:34)
    at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
    at za.co.absa.spline.harvester.plugin.embedded.SQLPlugin$$anonfun$1.applyOrElse(SQLPlugin.scala:48)
    at za.co.absa.spline.harvester.plugin.embedded.SQLPlugin$$anonfun$1.applyOrElse(SQLPlugin.scala:48)
    at scala.PartialFunction$OrElse.apply(PartialFunction.scala:172)
    at za.co.absa.spline.harvester.plugin.embedded.DataSourceV2Plugin$$anonfun$1.applyOrElse(DataSourceV2Plugin.scala:43)
    at za.co.absa.spline.harvester.plugin.embedded.DataSourceV2Plugin$$anonfun$1.applyOrElse(DataSourceV2Plugin.scala:43)
    at scala.PartialFunction$OrElse.apply(PartialFunction.scala:172)
    at za.co.absa.spline.harvester.builder.read.PluggableReadCommandExtractor$$anonfun$1.applyOrElse(PluggableReadCommandExtractor.scala:48)
    at za.co.absa.spline.harvester.builder.read.PluggableReadCommandExtractor$$anonfun$1.applyOrElse(PluggableReadCommandExtractor.scala:46)
    at scala.PartialFunction$Lifted.apply(PartialFunction.scala:228)
    at scala.PartialFunction$Lifted.apply(PartialFunction.scala:224)
    at scala.PartialFunction$.condOpt(PartialFunction.scala:292)
    at za.co.absa.spline.harvester.builder.read.PluggableReadCommandExtractor.asReadCommand(PluggableReadCommandExtractor.scala:46)
    at za.co.absa.spline.harvester.LineageHarvester.createOperationBuilder(LineageHarvester.scala:191)
    at za.co.absa.spline.harvester.LineageHarvester.$anonfun$createOperationBuildersRecursively$1(LineageHarvester.scala:167)
    at scala.Option.getOrElse(Option.scala:189)
    at za.co.absa.spline.harvester.LineageHarvester.traverseAndCollect$1(LineageHarvester.scala:167)
    at za.co.absa.spline.harvester.LineageHarvester.createOperationBuildersRecursively(LineageHarvester.scala:186)
    at za.co.absa.spline.harvester.LineageHarvester.$anonfun$harvest$4(LineageHarvester.scala:63)
    at scala.Option.flatMap(Option.scala:271)
    at za.co.absa.spline.harvester.LineageHarvester.harvest(LineageHarvester.scala:61)
    at za.co.absa.spline.agent.SplineAgent$$anon$1.$anonfun$handle$1(SplineAgent.scala:91)
    at za.co.absa.spline.agent.SplineAgent$$anon$1.withErrorHandling(SplineAgent.scala:100)
    at za.co.absa.spline.agent.SplineAgent$$anon$1.handle(SplineAgent.scala:72)
    at za.co.absa.spline.harvester.listener.QueryExecutionListenerDelegate.onSuccess(QueryExecutionListenerDelegate.scala:28)
    at za.co.absa.spline.harvester.listener.SplineQueryExecutionListener.$anonfun$onSuccess$1(SplineQueryExecutionListener.scala:41)
    at za.co.absa.spline.harvester.listener.SplineQueryExecutionListener.$anonfun$onSuccess$1$adapted(SplineQueryExecutionListener.scala:41)
    at scala.Option.foreach(Option.scala:407)
    at za.co.absa.spline.harvester.listener.SplineQueryExecutionListener.onSuccess(SplineQueryExecutionListener.scala:41)
    at org.apache.spark.sql.util.ExecutionListenerBus.doPostEvent(QueryExecutionListener.scala:165)
    at org.apache.spark.sql.util.ExecutionListenerBus.doPostEvent(QueryExecutionListener.scala:135)
    at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117)
    at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101)
    at org.apache.spark.sql.util.ExecutionListenerBus.postToAll(QueryExecutionListener.scala:135)
    at org.apache.spark.sql.util.ExecutionListenerBus.onOtherEvent(QueryExecutionListener.scala:147)
    at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:100)
    at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28)
    at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
    at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
    at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117)
    at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101)
    at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105)
    at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105)
    at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23)
    at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
    at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100)
    at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96)
    at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1446)
    at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96)
23/04/17 14:30:57 INFO KafkaProducer: [Producer clientId=producer-1] Closing the Kafka producer with timeoutMillis = 9223372036854775807 ms.
jinmu0410 commented 1 year ago

excel path like hdfs://lake-node1:8020/jinmu/test/test_simple.xlsx

cerveada commented 1 year ago

What versions of Spark and Spline Agent were used?

jinmu0410 commented 1 year ago

@cerveada 1.0

jinmu0410 commented 1 year ago

spark 3.3.1

jinmu0410 commented 1 year ago

like file:///Users/jinmu/Downloads/test_simple.xlsx is ok! but hdfs://..... is error

cerveada commented 1 year ago

That is what I thought, I will try to simulate the issue and fix this.

jinmu0410 commented 1 year ago

thanks

cerveada commented 1 year ago

@jinmu0410 I was able to reproduce the issue. Unfortunately, the needed url is an arg of some lambda expression and I don't know how to extract it. I would need more time to do it, which I don't have now.

But, spark-excel also supports Sparks's data source V2 which should work out of the box. I added some test and even test it on hdfs, and it was working fine. So I recommend using DSV2 and that should fix the lineage issues as well.

see: https://github.com/crealytics/spark-excel#excel-api-based-on-datasourcev2

jinmu0410 commented 1 year ago

@jinmu0410 I was able to reproduce the issue. Unfortunately, the needed url is an arg of some lambda expression and I don't know how to extract it. I would need more time to do it, which I don't have now.

But, spark-excel also supports Sparks's data source V2 which should work out of the box. I added some test and even test it on hdfs, and it was working fine. So I recommend using DSV2 and that should fix the lineage issues as well.

see: https://github.com/crealytics/spark-excel#excel-api-based-on-datasourcev2

ok thank you i will try