dfdx / Spark.jl

Julia binding for Apache Spark
Other
205 stars 39 forks source link

New version does not work as stated in docs #112

Closed JuanVargas closed 1 year ago

JuanVargas commented 2 years ago

While I would really like to see this project succeed, every time I try to use it new issues surface

  1. I reported previously as issue #108 that the line 86 in file init.jl was causing an error p = split(read(spark_defaults_conf, String), '\n', keepempty=false)
  2. The fix provided by Andrei worked: p = split(Base.read(spark_defaults_conf, String), '\n', keepempty=false)
  3. Unfortunately the issue resurfaced in the new version. Line 86 is back to p = split(read(spark_defaults_conf, String), '\n', keepempty=false) which obviously is easily 'fixable',
  4. Unfortunately the changes to the new version v0.61 cause failures elsewhere.
  5. Before adding/building/using the new code from v061, I removed the Spark package and also erased the source files in my local machine where the Spark files were maintainted.
  6. Reinstalled the new version of spark
using Pkg; Pkg.add("Spark")
ENV["BUILD_SPARK_VERSION"] = "3.2.1"   # version you need
Pkg.build("Spark")
  1. Tried to follow the code example as given in the documentation of the new version:
using Spark.SQL #### ERROR: SQL does not exist

using Spark
spark = Spark.SparkSession.builder.appName("Main").master("local").getOrCreate()
df = spark.createDataFrame([["Alice", 19], ["Bob", 23]], "name string, age long")

###### Exception in thread "main" java.lang.NoSuchMethodError: make
###  JavaCall.JavaCallError("Error calling Java: java.lang.NoSuchMethodError: make")

Stacktrace:
  [1] geterror(allow::Bool)
    @ JavaCall ~/.julia/packages/JavaCall/MlduK/src/core.jl:418
  [2] jcall(typ::Type{JavaCall.JavaObject{Symbol("scala.collection.mutable.ArraySeq")}}, method::String, rettype::Type, argtypes::Tuple{DataType}, args::JavaCall.JObject)
    @ JavaCall ~/.julia/packages/JavaCall/MlduK/src/core.jl:226
  [3] convert
    @ ~/.julia/packages/Spark/0luxD/src/convert.jl:81 [inlined]
  [4] Row
    @ ~/.julia/packages/Spark/0luxD/src/row.jl:16 [inlined]
  [5] iterate
    @ ./generator.jl:47 [inlined]
  [6] _collect(c::Vector{Vector{Any}}, itr::Base.Generator{Vector{Vector{Any}}, Type{Row}}, #unused#::Base.EltypeUnknown, isz::Base.HasShape{1})
    @ Base ./array.jl:744
  [7] collect_similar
    @ ./array.jl:653 [inlined]
dfdx commented 2 years ago

Can you post the output of this (in Julia console)?

] st Spark
JuanVargas commented 2 years ago

p2.jl.txt

dfdx commented 2 years ago

Let's go through your code piece by piece:

using Pkg; Pkg.add("Spark")
Pkg.add("CSV"); using CSV

I'm not sure why you refer to CSV here - it is a completely different and independent package. Please find its docs and report any issues in the linked repository.

ENV["BUILD_SPARK_VERSION"] = "3.2.1"   # version you need
Pkg.build("Spark")

You don't need to do it every time. In fact, Apache Spark version 3.2.1 is the default in the current version of Spark.jl, so you can skip these lines completely.

using Spark
import Spark.SparkSession
Spark.init()

using Spark already exports SparkSession and calls init(), so just the first line - using Spark - is enough.

Then you have 4 examples of DataFrame creation that work and you seem to be happy with it. Then:

#  code beow DID NOT WORK
# df = CSV.File(csvDir*f1*".csv"; header=true) |> DataFrame

As I said earlier, Spark.jl doesn't have any integration with CSV.jl and thus can't convert CSV.File to a Spark.DataFrame. We may add the converter, but it will load the whole file into memory, which is generally discouraged in Spark. The right way to do it is to use Spark's own methods, e.g. this:

# code below worked
# fCsv = spark.read.option("header", true).csv(csvDir*f1*".csv")

Next:

dfGroceries = spark.createDataFrame(rowGroceries, StructType("type string, color string, price double, amount int")) isa DataFrame

Here you create an object and check that it's a DataFrame, so the result if the expression is of type Bool, not DataFrame. Instead, try this:

dfGroceries = spark.createDataFrame(rowGroceries)   # schema will be inferred automatically

Now you can call .first() and .show() on this data frame:

julia> dfGroceries.first()
[banana,yellow,1.0,10]

julia> dfGroceries.show()
+------+------+-----+------+
|  type| color|price|amount|
+------+------+-----+------+
|banana|yellow|  1.0|    10|
| apple|yellow|  2.0|    20|
|carrot|yellow|  0.5|    30|
| grape|  blue| 0.25|    50|
+------+------+-----+------+

Note that are still no methods .first(num) and .size(). Probably, instead of .first(num) you want to use .head(num), and instead of .size() you can use length(dfGroceries.columns()) to count columns or dfGroceries.count().


As a bottom line:

  1. Read the docs. If you see inconsistency between Spark.jl and its docs, please report. Most docs are now self-tested, but mistakes still happen. The easiest way to fix them is to report specific entry in the docs and actual behavior.
  2. Don't assume integrations. Spark.jl currently doesn't work with CSV.jl, DataFrames.jl or any other similar package. These integrations may be added later, but it's not on the table right now.
  3. Don't assume methods. If something is not directly mentioned in the Spark.jl's docs, you can check out PySpark's docs or even read the source, which is intentionally split into files with self-describing names (e.g. all methods that you can call on a DataFrame are grouped in dataframe.jl).
JuanVargas commented 2 years ago

Thanks, I'll check it out.

On Wed, Jul 27, 2022 at 3:27 AM Andrei Zhabinski @.***> wrote:

Let's go through your code piece by piece:

using Pkg; Pkg.add("Spark") Pkg.add("CSV"); using CSV

I'm not sure why you refer to CSV here - it is a completely different and independent package https://github.com/JuliaData/CSV.jl. Please find its docs and report any issues in the linked repository.

ENV["BUILD_SPARK_VERSION"] = "3.2.1" # version you need Pkg.build("Spark")

You don't need to do it every time. In fact, Apache Spark version 3.2.1 is the default in the current version of Spark.jl, so you can skip these lines completely.

using Sparkimport Spark.SparkSession Spark.init()

using Spark already exports SparkSession and calls init(), so just the first line - using Spark - is enough.

Then you have 4 examples of DataFrame creation that work and you seem to be happy with it. Then:

code beow DID NOT WORK# df = CSV.File(csvDirf1".csv"; header=true) |> DataFrame

As I said earlier, Spark.jl doesn't have any integration with CSV.jl and thus can't convert CSV.File to a Spark.DataFrame. We may add the converter, but it will load the whole file into memory, which is generally discouraged in Spark. The right way to do it is to use Spark's own methods, e.g. this:

code below worked# fCsv = spark.read.option("header", true).csv(csvDirf1".csv")

Next:

dfGroceries = spark.createDataFrame(rowGroceries, StructType("type string, color string, price double, amount int")) isa DataFrame

Here you create an object and check that it's a DataFrame, so the result if the expression is of type Bool, not DataFrame. Instead, try this:

dfGroceries = spark.createDataFrame(rowGroceries) # schema will be inferred automatically

Now you can call .first() and .show() on this data frame:

julia> dfGroceries.first() [banana,yellow,1.0,10]

julia> dfGroceries.show()+------+------+-----+------+| type| color|price|amount|+------+------+-----+------+|banana|yellow| 1.0| 10|| apple|yellow| 2.0| 20||carrot|yellow| 0.5| 30|| grape| blue| 0.25| 50|+------+------+-----+------+

Note that are still no methods .first(num) and .size(). Probably, instead of .first(num) you want to use .head(num), and instead of .size() you can use length(dfGroceries.columns()) to count columns or dfGroceries.count().

As a bottom line:

  1. Read the docs. If you see inconsistency between Spark.jl and its docs, please report. Most docs are now self-tested, but mistakes still happen. The easiest way to fix them is to report specific entry in the docs and actual behavior.
  2. Don't assume integrations. Spark.jl currently doesn't work with CSV.jl, DataFrames.jl or any other similar package. These integrations may be added later, but it's not on the table right now.
  3. Don't assume methods. If something is not directly mentioned in the Spark.jl's docs http://dfdx.github.io/Spark.jl/dev, you can check out PySpark's docs https://spark.apache.org/docs/3.1.1/api/python/reference/pyspark.sql.html or even read the source, which is intentionally split into files with self-describing names (e.g. all methods that you can call on a DataFrame are grouped in dataframe.jl https://github.com/dfdx/Spark.jl/blob/main/src/dataframe.jl).

— Reply to this email directly, view it on GitHub https://github.com/dfdx/Spark.jl/issues/112#issuecomment-1196361824, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGK34MUXL3HGVT6Y34RTNDVWDQHJANCNFSM54W2S7KQ . You are receiving this because you authored the thread.Message ID: @.***>

JuanVargas commented 2 years ago

Hi Andrei,

Yes, I revisited the code again as per your suggestions and I can see better what is happening. In retrospect I assumed that the DataFrames from Spark.jl were the same as the DataFrames from the Julia base, but when I traced the code, it looks like in fact the data frames from Spark.jl come from org.apache.spark.sql.Dataset, therefore the methods etc are NOT the same. This also means that I need to take a deeper dive into the PySpark to get more clarity. This exercise is proving very illuminating for me and I hope it is not too frustrating for you. Thank you.

On Wed, Jul 27, 2022 at 7:29 AM Juan Vargas @.***> wrote:

Thanks, I'll check it out.

On Wed, Jul 27, 2022 at 3:27 AM Andrei Zhabinski @.***> wrote:

Let's go through your code piece by piece:

using Pkg; Pkg.add("Spark") Pkg.add("CSV"); using CSV

I'm not sure why you refer to CSV here - it is a completely different and independent package https://github.com/JuliaData/CSV.jl. Please find its docs and report any issues in the linked repository.

ENV["BUILD_SPARK_VERSION"] = "3.2.1" # version you need Pkg.build("Spark")

You don't need to do it every time. In fact, Apache Spark version 3.2.1 is the default in the current version of Spark.jl, so you can skip these lines completely.

using Sparkimport Spark.SparkSession Spark.init()

using Spark already exports SparkSession and calls init(), so just the first line - using Spark - is enough.

Then you have 4 examples of DataFrame creation that work and you seem to be happy with it. Then:

code beow DID NOT WORK# df = CSV.File(csvDirf1".csv"; header=true) |> DataFrame

As I said earlier, Spark.jl doesn't have any integration with CSV.jl and thus can't convert CSV.File to a Spark.DataFrame. We may add the converter, but it will load the whole file into memory, which is generally discouraged in Spark. The right way to do it is to use Spark's own methods, e.g. this:

code below worked# fCsv = spark.read.option("header", true).csv(csvDirf1".csv")

Next:

dfGroceries = spark.createDataFrame(rowGroceries, StructType("type string, color string, price double, amount int")) isa DataFrame

Here you create an object and check that it's a DataFrame, so the result if the expression is of type Bool, not DataFrame. Instead, try this:

dfGroceries = spark.createDataFrame(rowGroceries) # schema will be inferred automatically

Now you can call .first() and .show() on this data frame:

julia> dfGroceries.first() [banana,yellow,1.0,10]

julia> dfGroceries.show()+------+------+-----+------+| type| color|price|amount|+------+------+-----+------+|banana|yellow| 1.0| 10|| apple|yellow| 2.0| 20||carrot|yellow| 0.5| 30|| grape| blue| 0.25| 50|+------+------+-----+------+

Note that are still no methods .first(num) and .size(). Probably, instead of .first(num) you want to use .head(num), and instead of .size() you can use length(dfGroceries.columns()) to count columns or dfGroceries.count().

As a bottom line:

  1. Read the docs. If you see inconsistency between Spark.jl and its docs, please report. Most docs are now self-tested, but mistakes still happen. The easiest way to fix them is to report specific entry in the docs and actual behavior.
  2. Don't assume integrations. Spark.jl currently doesn't work with CSV.jl, DataFrames.jl or any other similar package. These integrations may be added later, but it's not on the table right now.
  3. Don't assume methods. If something is not directly mentioned in the Spark.jl's docs http://dfdx.github.io/Spark.jl/dev, you can check out PySpark's docs https://spark.apache.org/docs/3.1.1/api/python/reference/pyspark.sql.html or even read the source, which is intentionally split into files with self-describing names (e.g. all methods that you can call on a DataFrame are grouped in dataframe.jl https://github.com/dfdx/Spark.jl/blob/main/src/dataframe.jl).

— Reply to this email directly, view it on GitHub https://github.com/dfdx/Spark.jl/issues/112#issuecomment-1196361824, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGK34MUXL3HGVT6Y34RTNDVWDQHJANCNFSM54W2S7KQ . You are receiving this because you authored the thread.Message ID: @.***>

dfdx commented 1 year ago

Closing because of no activity.