Closed JuanVargas closed 1 year ago
Can you post the output of this (in Julia console)?
] st Spark
Let's go through your code piece by piece:
using Pkg; Pkg.add("Spark")
Pkg.add("CSV"); using CSV
I'm not sure why you refer to CSV here - it is a completely different and independent package. Please find its docs and report any issues in the linked repository.
ENV["BUILD_SPARK_VERSION"] = "3.2.1" # version you need
Pkg.build("Spark")
You don't need to do it every time. In fact, Apache Spark version 3.2.1 is the default in the current version of Spark.jl, so you can skip these lines completely.
using Spark
import Spark.SparkSession
Spark.init()
using Spark
already exports SparkSession
and calls init()
, so just the first line - using Spark
- is enough.
Then you have 4 examples of DataFrame
creation that work and you seem to be happy with it. Then:
# code beow DID NOT WORK
# df = CSV.File(csvDir*f1*".csv"; header=true) |> DataFrame
As I said earlier, Spark.jl doesn't have any integration with CSV.jl and thus can't convert CSV.File
to a Spark.DataFrame
. We may add the converter, but it will load the whole file into memory, which is generally discouraged in Spark. The right way to do it is to use Spark's own methods, e.g. this:
# code below worked
# fCsv = spark.read.option("header", true).csv(csvDir*f1*".csv")
Next:
dfGroceries = spark.createDataFrame(rowGroceries, StructType("type string, color string, price double, amount int")) isa DataFrame
Here you create an object and check that it's a DataFrame
, so the result if the expression is of type Bool
, not DataFrame
. Instead, try this:
dfGroceries = spark.createDataFrame(rowGroceries) # schema will be inferred automatically
Now you can call .first()
and .show()
on this data frame:
julia> dfGroceries.first()
[banana,yellow,1.0,10]
julia> dfGroceries.show()
+------+------+-----+------+
| type| color|price|amount|
+------+------+-----+------+
|banana|yellow| 1.0| 10|
| apple|yellow| 2.0| 20|
|carrot|yellow| 0.5| 30|
| grape| blue| 0.25| 50|
+------+------+-----+------+
Note that are still no methods .first(num)
and .size()
. Probably, instead of .first(num)
you want to use .head(num)
, and instead of .size()
you can use length(dfGroceries.columns())
to count columns or dfGroceries.count()
.
As a bottom line:
DataFrame
are grouped in dataframe.jl).Thanks, I'll check it out.
On Wed, Jul 27, 2022 at 3:27 AM Andrei Zhabinski @.***> wrote:
Let's go through your code piece by piece:
using Pkg; Pkg.add("Spark") Pkg.add("CSV"); using CSV
I'm not sure why you refer to CSV here - it is a completely different and independent package https://github.com/JuliaData/CSV.jl. Please find its docs and report any issues in the linked repository.
ENV["BUILD_SPARK_VERSION"] = "3.2.1" # version you need Pkg.build("Spark")
You don't need to do it every time. In fact, Apache Spark version 3.2.1 is the default in the current version of Spark.jl, so you can skip these lines completely.
using Sparkimport Spark.SparkSession Spark.init()
using Spark already exports SparkSession and calls init(), so just the first line - using Spark - is enough.
Then you have 4 examples of DataFrame creation that work and you seem to be happy with it. Then:
code beow DID NOT WORK# df = CSV.File(csvDirf1".csv"; header=true) |> DataFrame
As I said earlier, Spark.jl doesn't have any integration with CSV.jl and thus can't convert CSV.File to a Spark.DataFrame. We may add the converter, but it will load the whole file into memory, which is generally discouraged in Spark. The right way to do it is to use Spark's own methods, e.g. this:
code below worked# fCsv = spark.read.option("header", true).csv(csvDirf1".csv")
Next:
dfGroceries = spark.createDataFrame(rowGroceries, StructType("type string, color string, price double, amount int")) isa DataFrame
Here you create an object and check that it's a DataFrame, so the result if the expression is of type Bool, not DataFrame. Instead, try this:
dfGroceries = spark.createDataFrame(rowGroceries) # schema will be inferred automatically
Now you can call .first() and .show() on this data frame:
julia> dfGroceries.first() [banana,yellow,1.0,10]
julia> dfGroceries.show()+------+------+-----+------+| type| color|price|amount|+------+------+-----+------+|banana|yellow| 1.0| 10|| apple|yellow| 2.0| 20||carrot|yellow| 0.5| 30|| grape| blue| 0.25| 50|+------+------+-----+------+
Note that are still no methods .first(num) and .size(). Probably, instead of .first(num) you want to use .head(num), and instead of .size() you can use length(dfGroceries.columns()) to count columns or dfGroceries.count().
As a bottom line:
- Read the docs. If you see inconsistency between Spark.jl and its docs, please report. Most docs are now self-tested, but mistakes still happen. The easiest way to fix them is to report specific entry in the docs and actual behavior.
- Don't assume integrations. Spark.jl currently doesn't work with CSV.jl, DataFrames.jl or any other similar package. These integrations may be added later, but it's not on the table right now.
- Don't assume methods. If something is not directly mentioned in the Spark.jl's docs http://dfdx.github.io/Spark.jl/dev, you can check out PySpark's docs https://spark.apache.org/docs/3.1.1/api/python/reference/pyspark.sql.html or even read the source, which is intentionally split into files with self-describing names (e.g. all methods that you can call on a DataFrame are grouped in dataframe.jl https://github.com/dfdx/Spark.jl/blob/main/src/dataframe.jl).
— Reply to this email directly, view it on GitHub https://github.com/dfdx/Spark.jl/issues/112#issuecomment-1196361824, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGK34MUXL3HGVT6Y34RTNDVWDQHJANCNFSM54W2S7KQ . You are receiving this because you authored the thread.Message ID: @.***>
Hi Andrei,
Yes, I revisited the code again as per your suggestions and I can see better what is happening. In retrospect I assumed that the DataFrames from Spark.jl were the same as the DataFrames from the Julia base, but when I traced the code, it looks like in fact the data frames from Spark.jl come from org.apache.spark.sql.Dataset, therefore the methods etc are NOT the same. This also means that I need to take a deeper dive into the PySpark to get more clarity. This exercise is proving very illuminating for me and I hope it is not too frustrating for you. Thank you.
On Wed, Jul 27, 2022 at 7:29 AM Juan Vargas @.***> wrote:
Thanks, I'll check it out.
On Wed, Jul 27, 2022 at 3:27 AM Andrei Zhabinski @.***> wrote:
Let's go through your code piece by piece:
using Pkg; Pkg.add("Spark") Pkg.add("CSV"); using CSV
I'm not sure why you refer to CSV here - it is a completely different and independent package https://github.com/JuliaData/CSV.jl. Please find its docs and report any issues in the linked repository.
ENV["BUILD_SPARK_VERSION"] = "3.2.1" # version you need Pkg.build("Spark")
You don't need to do it every time. In fact, Apache Spark version 3.2.1 is the default in the current version of Spark.jl, so you can skip these lines completely.
using Sparkimport Spark.SparkSession Spark.init()
using Spark already exports SparkSession and calls init(), so just the first line - using Spark - is enough.
Then you have 4 examples of DataFrame creation that work and you seem to be happy with it. Then:
code beow DID NOT WORK# df = CSV.File(csvDirf1".csv"; header=true) |> DataFrame
As I said earlier, Spark.jl doesn't have any integration with CSV.jl and thus can't convert CSV.File to a Spark.DataFrame. We may add the converter, but it will load the whole file into memory, which is generally discouraged in Spark. The right way to do it is to use Spark's own methods, e.g. this:
code below worked# fCsv = spark.read.option("header", true).csv(csvDirf1".csv")
Next:
dfGroceries = spark.createDataFrame(rowGroceries, StructType("type string, color string, price double, amount int")) isa DataFrame
Here you create an object and check that it's a DataFrame, so the result if the expression is of type Bool, not DataFrame. Instead, try this:
dfGroceries = spark.createDataFrame(rowGroceries) # schema will be inferred automatically
Now you can call .first() and .show() on this data frame:
julia> dfGroceries.first() [banana,yellow,1.0,10]
julia> dfGroceries.show()+------+------+-----+------+| type| color|price|amount|+------+------+-----+------+|banana|yellow| 1.0| 10|| apple|yellow| 2.0| 20||carrot|yellow| 0.5| 30|| grape| blue| 0.25| 50|+------+------+-----+------+
Note that are still no methods .first(num) and .size(). Probably, instead of .first(num) you want to use .head(num), and instead of .size() you can use length(dfGroceries.columns()) to count columns or dfGroceries.count().
As a bottom line:
- Read the docs. If you see inconsistency between Spark.jl and its docs, please report. Most docs are now self-tested, but mistakes still happen. The easiest way to fix them is to report specific entry in the docs and actual behavior.
- Don't assume integrations. Spark.jl currently doesn't work with CSV.jl, DataFrames.jl or any other similar package. These integrations may be added later, but it's not on the table right now.
- Don't assume methods. If something is not directly mentioned in the Spark.jl's docs http://dfdx.github.io/Spark.jl/dev, you can check out PySpark's docs https://spark.apache.org/docs/3.1.1/api/python/reference/pyspark.sql.html or even read the source, which is intentionally split into files with self-describing names (e.g. all methods that you can call on a DataFrame are grouped in dataframe.jl https://github.com/dfdx/Spark.jl/blob/main/src/dataframe.jl).
— Reply to this email directly, view it on GitHub https://github.com/dfdx/Spark.jl/issues/112#issuecomment-1196361824, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGK34MUXL3HGVT6Y34RTNDVWDQHJANCNFSM54W2S7KQ . You are receiving this because you authored the thread.Message ID: @.***>
Closing because of no activity.
While I would really like to see this project succeed, every time I try to use it new issues surface