databricks / spark-avro

Avro Data Source for Apache Spark
http://databricks.com/
Apache License 2.0
539 stars 310 forks source link

spark 2.1.0 : NullpointerException Issue in java while reading avro file as dataframe with kryo enabled as serializer #289

Closed erparas closed 5 years ago

erparas commented 6 years ago

Hi team,

I am trying to read an avro data file as spark dataframe but it is throwing null pointerexception. I have enabled kryo as serialzer and below are the details :

Code snippet : Dataset table= sparkSessionObject.read().format("com.databricks.spark.avro").load("/tmp/table"); table.show();

Note: when I use the javaSerializer it works fine.

Jar version details - spark-sql_2.11-2.1.0.cloudera1.jar spark-avro_2.11-3.2.0.jar kryo-shaded-3.0.3.jar

Stacktrace :

Caused by: java.lang.NullPointerException at com.databricks.spark.avro.DefaultSource$$anonfun$buildReader$1.apply(DefaultSource.scala:170) at com.databricks.spark.avro.DefaultSource$$anonfun$buildReader$1.apply(DefaultSource.scala:160) at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:138) at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:122) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:168) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748)

Could you please help me out on this.

mxhdev commented 5 years ago

In case you still have this problem, or someone else encounters it: I´ve had the same issue while using spark2 with spark-avro_2.11:3.0.0. However, when I switched to 4.0.0, the issue was fixed. I´ve started the spark2-shell like this: spark2-shell --packages com.databricks:spark-avro_2.11:4.0.0

Running the following code (without problems)

import com.databricks.spark.avro._
spark.read.avro("PATH_TO_AVROFILES").show()
erparas commented 5 years ago

Yes, the newer version of spark-avro jar did resolve the issue.

Thanks.