dfdx / Spark.jl

Julia binding for Apache Spark
Other
205 stars 39 forks source link

Spark.init() throws an error #54

Closed niczky12 closed 6 years ago

niczky12 commented 6 years ago

I'm trying to use the package on a mac with Julia 0.6.2, but when I try Spark.init() I get the following error:

ERROR: SystemError: opening file /usr/local/Cellar/apache-spark/2.2.1/libexec/conf/spark-defaults.conf: No such file or directory
Stacktrace:
 [1] #systemerror#44 at ./error.jl:64 [inlined]
 [2] systemerror(::String, ::Bool) at ./error.jl:64
 [3] open(::String, ::Bool, ::Bool, ::Bool, ::Bool, ::Bool) at ./iostream.jl:104
 [4] open(::Base.#readstring, ::String) at ./iostream.jl:150
 [5] load_spark_defaults(::Dict{Any,Any}) at /Users/bkomar/.julia/v0.6/Spark/src/init.jl:51
 [6] init() at /Users/bkomar/.julia/v0.6/Spark/src/init.jl:5

I have spark installed and it runs okay from R via sparklyr are there some additional setup steps that I missed?

Thanks

dfdx commented 6 years ago

I'd start from setting SPARK_HOME="" (empty string) and maybe rebuilding the package with Pkg.build("Spark"). In this case I expect Spark.jl to use built-in version of Spark (as part of uberjar in jvm/sparkjl) and run smoothly.

If this works, you can update this line to 2.2.1 (which matches your version of Spark), rebuild again and set SPARK_HOME to the previous version.

Please, let me know if this works.

niczky12 commented 6 years ago

I tried setting SPARK_HOME="" in Julia, but didn't do anything. Then I set this as environment variable on my mac. That worked, sort of... It did rebuild Spark successfully, but when I try to run Spark.init() it throws the following:

julia> Spark.init()

signal (11): Segmentation fault: 11
while loading no file, in expression starting on line 0
unknown function (ip: 0x11e6842b3)
Allocations: 4098280 (Pool: 4096766; Big: 1514); GC: 6

I'm not sure where and how I would update the referenced line to fix this. Can you give a bit more detailed info on this? Sorry if these are silly questions. Thanks!

dfdx commented 6 years ago

I tried setting SPARK_HOME="" in Julia, but didn't do anything. Then I set this as environment variable on my mac.

Ah, sorry, I indeed meant env var. Glad that it resolved the issue.

signal (11): Segmentation fault: 11

It looks like the known issue with JavaCall.jl / JVM on macos. Fortunately, it's just a wrong message that doesn't actually prevent you from running the code: even though REPL looks like it hangs up, you can actually press Enter and it should continue normally.

niczky12 commented 6 years ago

This works! Thanks a lot. So what is that bit that you're saying about using my already installed spark version?

dfdx commented 6 years ago

When you run Pkg.build("Spark") it builds Maven project with pom.xml defining all its dependencies, including Spark and its version - 2.1.0 by default. On the other hand, your SPARK_HOME (before editing) points to a separate installation, /usr/local/Cellar/apache-spark/2.2.1/ according to the error message. So when you initialize Spark.jl, it looks at your SPARK_HOME and tries to read a config file from it, but version 2.2.1 doesn't have that config and fails.

I wrote a detailed description how to bypass it by manually editing pom.xml, but it made me think what would be the proper fix for it which resulted in configurable-spark-version branch. Please, check it out and run from the Julia REPL:

ENV["BUILD_SPARK_VERSION"] = "2.2.1"
Pkg.build("Spark")

Then set SPARK_HOME env var to its default value on your system (e.g. by opening a fresh terminal window) and try running examples. If this works for you, I'll merge this branch and make the feature available for everybody.

niczky12 commented 6 years ago

I tried the above. First I checked out configurable-spark-version by running: Pkg.checkout("Spark", "configurable-spark-version"). I made sure SPARK_HOME was not set in as a environment variable and set ENV as you said above. Build run ok, but Spark.init() gave me the same error as before:

julia> Spark.init()
ERROR: SystemError: opening file /usr/local/Cellar/apache-spark/2.2.1/libexec/conf/spark-defaults.conf: No such file or directory
Stacktrace:
 [1] #systemerror#44 at ./error.jl:64 [inlined]
 [2] systemerror(::String, ::Bool) at ./error.jl:64
 [3] open(::String, ::Bool, ::Bool, ::Bool, ::Bool, ::Bool) at ./iostream.jl:104
 [4] open(::Base.#readstring, ::String) at ./iostream.jl:150
 [5] load_spark_defaults(::Dict{Any,Any}) at /Users/bkomar/.julia/v0.6/Spark/src/init.jl:51
 [6] init() at /Users/bkomar/.julia/v0.6/Spark/src/init.jl:5

Am I doing something wrong? I'd be happy to do some testing if needed :)

dfdx commented 6 years ago

In logs of Pkg.build(), what version of spark-core is used (this should be the text just next to "spark-core")?

It it's still "2.1.0", something went wrong. In this case maybe check from command line that the branch is indeed "configurable-spark-version" (I've never used Pkg.checkout(pkg, branch) version, so not sure it's stable).

If the version is "2.2.1", then I believe you have some unusual version of Spark with a different directory layout. How did you install it?

niczky12 commented 6 years ago

I had a look at the build logs, it seems like it's still using 2.1.`

[INFO] Including org.apache.spark:spark-core_2.11:jar:2.1.1 in the shaded jar.

I think I'm on the right branch according to Pkg:

julia> Pkg.status("Spark")
 - Spark                         0.2.0+             configurable-spark-version

Also confirmed by git on the command line:

lon-mac-bkomar:Spark bkomar$ git branch
* configurable-spark-version
  master

So I'm definitely on the correct branch. I'll reinstall Spark if I time over the weekend and see if that works. Thanks for your help.

Any gotchas I should look out for while installing Spark?

dfdx commented 6 years ago

I don't think reinstalling the same version of Spark will help: previously I experienced issues with Spark installed using different builds, e.g. one from Cloudera's CDH and another downloaded from the official website (CDH puts configs to a separate directory, together with configs of other Hadoop tools). If you use some build which significantly different from the on the official site, you may get into the same issue.

If this is the case and you can find where in your installation spark-defaults.conf is, we can update the way we discover this file and it may be enough to start working.

But first of all I'd ensure that Spark.jl and your installed Spark have the same version. Can you please change this line to have value 2.2.1 and rebuild Spark.jl once again?

niczky12 commented 6 years ago

I tried changing the pom.xml file but I have the same error. I also realised that this Spark install is coming from sparklyr, so it might be a different build than the official one. But I finally found the issue. Thank you so much for all your help. Basically this build of spark had spark-defaults.conf.template file instead of spark-defaults.conf in the above mentioned folder. I changed the init.jl where you pointed me and I was able to build and Spark.init() in Julia.

I'm not sure if this would be worth the trouble of fixing as it is not a bug but more of a problem with different spark builds....

Maybe there could be an option to change the location of whereSpark.jl expects to see this config file with an ENV variable? I don't know, I'm really out of my depth here.

If you want, I can run further tests on my machine regarding the different spark versions. Let me know and I'd be happy to help. Otherwise, we can just close this issue.

dfdx commented 6 years ago

I'ts worth to discover such thing automatically, so I created spark-conf-location branch that looks at spark-defaults.conf.template as well. Could you please check out this (totally untested) branch and tell if it now finds config correctly?

Also, did you manage to build Spark.jl for version 2.2.1 through environment variable using configurable-spark-version branch?

niczky12 commented 6 years ago

Hi,

So I could build Spark.jl for version 2.2.1 with configurable-spark-version but Spark.init() failed due to the different file name.

I checked out your spark-conf-location branch. There was one typo in it on line 57 in src/init.jl:

-        spark_defaults_conf = spark_default_locs[conf_idx]
+        spark_defaults_conf = spark_defaults_locs[conf_idx]

I fixed this manually and both Pkg.build and Spark.init run flawlessly on my machine. 😄

dfdx commented 6 years ago

Perfect, so I'll fix the typo and merge both branches. Thanks for testing and debugging!

dfdx commented 6 years ago

Done, merged both branches (configurable Spark version and improved config discovery) to master.

Is there anything else I can help within this issue?

niczky12 commented 6 years ago

Nope. I think this can be closed. Thanks for your help.