Open Nuglar opened 2 years ago
Hi @Nuglar :-) Which versions of the pyspark package did you try and for which versions did you experience the troubles? You only tested on the windows platform, right?
I extracted four major points from your text:
>>> import pyspark
>>> spark = pyspark.sql.SparkSession.builder.getOrCreate()
22/07/25 13:24:28 WARN Utils: Your hostname ...
22/07/25 13:24:28 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/07/25 13:24:29 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
>>> spark.createDataFrame([(1,)], schema=['id']).show()
+---+
| id|
+---+
| 1|
+---+
Furthermore, I do not understand why you would need the findspark
package, as $CONDA_PREFIX/lib/python3.8/site-packages/pyspark/find_spark_home.py
is already packaged with pyspark.
- Java/OpenJDK missing as dependency -> I am not a fan of depending on openjdk, as people might want to use any other java flavor... then depending on the packaged openjdk would be quite annoying. As the error message is quite clear, i would leave it like that and trust that people know what to do.
I have to say I disagree quite strongly - the package is not functional without openjdk, and we should ship it.
Regarding different java flavours, that's definitely not the default, and should IMO be opt-out, e.g. there could be an output pyspark-base
or pyspark-no-openjdk
that provides the package without openjdk.
Regarding wanting to use different java versions, it's possible to package against different java versions (c.f. pyjnius). In the case where I just encountered this, the error message is not clear, particularly if the machine already has a JAVA_HOME that's somehow populated.
I have to say I disagree quite strongly - the package is not functional without openjdk, and we should ship it.
This is exacerbated by the fact that the new java 17 LTS is only compatible with the very recent pyspark >=3.3, and anyone installing an older pyspark and doing "just add openjdk
" will run into another failure mode.
Will this ever be considered to get implemented? It's a PITA to get pyspark running on Windows.
A system environmental variable (not local, despite my miniconda being installed for the local user) had to be created called HADOOP_HOME, and set to the same as SPARK_HOME (obviously, this won't work when switching virtual environments, but you get the idea).
Especially needing to change these environment variables in the system settings everytime you want to use a new environment, and copying winutils and hadoop to the newly installed pyspark instance.
Will this ever be considered to get implemented?
If it doesn't break the average use-case, then of course it will be considered! PRs are always welcome. :)
Especially needing to change these environment variables in the system settings everytime you want to use a new environment, and copying winutils and hadoop to the newly installed pyspark instance.
That's possible to fix with activation scripts.
Thanks for the openjdk
tip, I was puzzled initially. Could this be solved with "variants", like MKL/OpenBLAS for libblas
? https://conda-forge.org/docs/maintainer/knowledge_base/#switching-blas-implementation, https://github.com/conda-forge/blas-feedstock/blob/main/recipe/meta.yaml
Solution to issue cannot be found in the documentation.
Issue
I have been unable to get conda-forge pyspark working out of the box, and have spent a couple of days figuring out what's going wrong. I am not versed enough to make a PR for myself, nor confident enough that this problem isn't observed by everyone to merit that PR. Regardless, I hope the info I put here is useful to the devs, or at least to people like me who are having trouble getting it working.
My process for installing pyspark locally:
There are four main issues:
java is not listed as a dependency for pyspark, which will resolve in a "java not found" error on launching pyspark.
winutils.exe is missing from SPARK_HOME (C:\Users\XXXXXXXX\Miniconda3\envs\pyspark_env\Lib\site-packages\pyspark\bin). This results in a WARNING when pyspark is run in shell ("Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries")
findspark is required to link python to spark. Without first running import findspark; findspark.init(), the error is thrown: "Python worker failed to connect back" on some pyspark commands
spark version 2.4 (installed with pyspark) has a bug in it that fails when run on windows, resulting in an error ModuleNotFound error for "resource" when some pyspark commands are used.
I'm happy to elaborate or provide clearer errors/steps as needed.
Installed packages
Environment info