big-data-europe / docker-hadoop-spark-workbench

[EXPERIMENTAL] This repo includes deployment instructions for running HDFS/Spark inside docker containers. Also includes spark-notebook and HDFS FileBrowser.
689 stars 374 forks source link

Job aborted due to stage failure while reading a simple Text File from HDFS #49

Open radianv opened 6 years ago

radianv commented 6 years ago

I working with spark notebooks, regarding to Scalable Spark/HDFS Workbench using Docker

val textFile = sc.textFile("/user/root/vannbehandlingsanlegg.csv")

textFile: org.apache.spark.rdd.RDD[String] = /user/root/vannbehandlingsanlegg.csv MapPartitionsRDD[1] at textFile at <console>:67

It will show the execution time and the number of lines in the csv file, but I got the next error:

cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD`

I have been searching and I saw it could be about executor dependencies, any idea?

radianv commented 6 years ago

As an additional information, I had done the same test connecting directly to spark-master container and it work well:

`scala> val textFile = sc.textFile("/user/root/vannbehandlingsanlegg.csv") textFile: org.apache.spark.rdd.RDD[String] = /user/root/vannbehandlingsanlegg.csv MapPartitionsRDD[1] at textFile at :24

scala> textFile.count res4: Long = 4385`

Probably the issue is in spark notebook configuration.

earthquakesan commented 6 years ago

Hi @radianv,

sorry for late reply, I had a lot of issues with spark notebook and has switched to Apache Zeppelin in the end. The issue you had is most likely version mismatch of Spark between spark notebook and Spark Master.

MahsaSeifikar commented 4 years ago

I have the same issue! Any solution?

SuperElectron commented 4 years ago

This is also an error inside spark-master container for val textFile = sc.textFile("/user/root/vannbehandlingsanlegg.csv").

From the adundance of errors in the issues related to HDFS and nodes/workers it seems like something in configuration is definately missing.

It is also worth noting that the walk-through blog steps do not work: https://www.big-data-europe.eu/scalable-sparkhdfs-workbench-using-docker/

Can anyone successfuly do the following steps in this ^^^^ blog post?