Closed samuel-pt closed 8 years ago
@samuel-pt thanks for reporting this.
Whatever classes are available in scala should be available to R since SparkR just passes the command on to the Spark backend: []
Is it possible for you to make a Jupyter notebook that's a minimally reproducible example replicating the error? I will try to use that to diagnose what's going on.
Can you also try to use the (deprecated) scala SQLContext.load(source, schema,...) function and see if that works?
@elbamos Thanks for looking into this issue.
I am not familiar with Jupyter notebook. May be if you sample, you can share it and let me update it with my code. Or else my code in notebook is very simple
just the below in one paragraph
// Add spark-csv package
From terminal,
$ wget
And add the below paragraph
df <- read.df(sqlContext, "cars.csv", source = "com.databricks.spark.csv", inferSchema = "true")
Please provide exact location to cars.csv file. Now executing this para will through ClassNotFoundException.
Sorry - I meant Zeppelin notebook. Pls include the command that shows that it works in scala.
Below scala command in zeppelin is working
val df =
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("cars.csv") // Provide exact location of cars.csv here
Also deprecated scala code SQLContext.load(source, schema,...) works as well
val df1 = sqlContext.load("/home/ubuntu/cars.csv", "com.databricks.spark.csv")
@samuel-pt This seems to be related to the dependency loader. It may take me a bit to unravel. In the meantime, if you send the read() command from scala, you should then be able to access the data from R as a temp table.
Thats fine, I will use the workaround i.e. read() from scala until this is solved. But this workaround will have performance impact if the data to be persisted in temp table is huge :(
@samuel-pt It shouldn't -- aren't temp tables just pointers with metadata, that aren't persisted?
Oh fine. I didn't know that.
Anyway It would help us to have everything in R instead of reading data from scala and doing every other thing in R
You're correct that this should be working from R.
However, it will never be possible to do everything in Spark from R.
That's the purpose of rZeppelin -- you can integrate scala-only functionality into the same pipeline with your R code, without breaking lazy evaluation.
@samuel-pt Apparently, loading from .csv is now accepted as a bug in Zeppelin. It should start to work in rZeppelin once its fixed in Zeppelin.
That's interesting. I'm curious why you suspect mesos?
Anyway, to debug this, it would help to know what R thinks is being returned by the createDataFrame() and head() functions, and what's happening during startup. Is it possible for you to try running each command as a separate notebook cell and upload the log of the whole zeppelin session?
I'd also appreciate if you could open this as a new issue - its definitely not the same thing samuel was dealing with.
Thanks for reporting this!
By adding the spark packages in interpreter settings I am able use it %r and %spark.r interpreters. Closing this issue.
Thanks for your help
I just cloned into an EC2 instance. And did the below steps to build zeppelin
I've configured SPARK_HOME in conf/ Then I started zeppelin using bin/ start. Zeppelin did get started and I am able to access it.
As I need to load a CSV file in R, I've loaded the spark-csv library as below
The library is successfully loaded and I see the artifacts in local-repo folder. But when I try to use the library using the below R code it fails.
Below is the error
It seems the library is not properly loaded in spark.r interpreter, as the below scala code works fine
Please let me know if I miss any configuration?
Thanks, Sam.