elbamos / Zeppelin-With-R

Mirror of Apache Zeppelin (Incubating)
Apache License 2.0
45 stars 24 forks source link

Using spark connectors in spark.r interpreter #5

Closed samuel-pt closed 8 years ago

samuel-pt commented 8 years ago

Hi,

I just cloned https://github.com/elbamos/Zeppelin-With-R into an EC2 instance. And did the below steps to build zeppelin

git clone https://github.com/elbamos/Zeppelin-With-R.git
cd Zeppelin-With-R
mvn package install -DskipTests

I've configured SPARK_HOME in conf/zeppelin-env.sh. Then I started zeppelin using bin/zeppelin-daemon.sh start. Zeppelin did get started and I am able to access it.

As I need to load a CSV file in R, I've loaded the spark-csv library as below

%dep
z.reset()

// Add spark-csv package
z.load("com.databricks:spark-csv_2.10:1.3.0")

The library is successfully loaded and I see the artifacts in local-repo folder. But when I try to use the library using the below R code it fails.

%spark.r

df <- read.df(sqlContext, "cars.csv", source = "com.databricks.spark.csv", inferSchema = "true") 

Below is the error

Caused by: java.lang.ClassNotFoundException: com.databricks.spark.csv.DefaultSource at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:60) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:60) at scala.util.Try$.apply(Try.scala:161) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:60) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:60) at scala.util.Try.orElse(Try.scala:82) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:60) ... 29 more

It seems the library is not properly loaded in spark.r interpreter, as the below scala code works fine

val df = sqlContext.read
    .format("com.databricks.spark.csv")
    .option("header", "true") // Use first line of all files as header
    .option("inferSchema", "true") // Automatically infer data types
    .load("cars.csv")

Please let me know if I miss any configuration?

Thanks, Sam.

elbamos commented 8 years ago

@samuel-pt thanks for reporting this.

Whatever classes are available in scala should be available to R since SparkR just passes the command on to the Spark backend: [https://github.com/apache/spark/blob/master/R/pkg/R/SQLContext.R#L518]

Is it possible for you to make a Jupyter notebook that's a minimally reproducible example replicating the error? I will try to use that to diagnose what's going on.

Can you also try to use the (deprecated) scala SQLContext.load(source, schema,...) function and see if that works?

samuel-pt commented 8 years ago

@elbamos Thanks for looking into this issue.

I am not familiar with Jupyter notebook. May be if you sample, you can share it and let me update it with my code. Or else my code in notebook is very simple

just the below in one paragraph

%dep
z.reset()

// Add spark-csv package
z.load("com.databricks:spark-csv_2.10:1.3.0")

From terminal,

$ wget https://github.com/databricks/spark-csv/raw/master/src/test/resources/cars.csv

And add the below paragraph

%spark.r

df <- read.df(sqlContext, "cars.csv", source = "com.databricks.spark.csv", inferSchema = "true") 

Please provide exact location to cars.csv file. Now executing this para will through ClassNotFoundException.

elbamos commented 8 years ago

Sorry - I meant Zeppelin notebook. Pls include the command that shows that it works in scala.

samuel-pt commented 8 years ago

Below scala command in zeppelin is working

val df = sqlContext.read
    .format("com.databricks.spark.csv")
    .option("header", "true") // Use first line of all files as header
    .option("inferSchema", "true") // Automatically infer data types
    .load("cars.csv") // Provide exact location of cars.csv here

Also deprecated scala code SQLContext.load(source, schema,...) works as well

val df1 = sqlContext.load("/home/ubuntu/cars.csv", "com.databricks.spark.csv")
elbamos commented 8 years ago

@samuel-pt This seems to be related to the dependency loader. It may take me a bit to unravel. In the meantime, if you send the read() command from scala, you should then be able to access the data from R as a temp table.

samuel-pt commented 8 years ago

Thats fine, I will use the workaround i.e. read() from scala until this is solved. But this workaround will have performance impact if the data to be persisted in temp table is huge :(

elbamos commented 8 years ago

@samuel-pt It shouldn't -- aren't temp tables just pointers with metadata, that aren't persisted?

samuel-pt commented 8 years ago

Oh fine. I didn't know that.

Anyway It would help us to have everything in R instead of reading data from scala and doing every other thing in R

elbamos commented 8 years ago

You're correct that this should be working from R.

However, it will never be possible to do everything in Spark from R.

That's the purpose of rZeppelin -- you can integrate scala-only functionality into the same pipeline with your R code, without breaking lazy evaluation.

ghost commented 8 years ago

@samuel-pt Apparently, loading from .csv is now accepted as a bug in Zeppelin. It should start to work in rZeppelin once its fixed in Zeppelin.

ghost commented 8 years ago

That's interesting. I'm curious why you suspect mesos?

Anyway, to debug this, it would help to know what R thinks is being returned by the createDataFrame() and head() functions, and what's happening during startup. Is it possible for you to try running each command as a separate notebook cell and upload the log of the whole zeppelin session?

I'd also appreciate if you could open this as a new issue - its definitely not the same thing samuel was dealing with.

Thanks for reporting this!

samuel-pt commented 8 years ago

By adding the spark packages in interpreter settings I am able use it %r and %spark.r interpreters. Closing this issue.

Thanks for your help