suggested improvements to resolve java, hadoop, python and resource errors

Nuglar commented 2 years ago

Solution to issue cannot be found in the documentation.

[X] I checked the documentation.

Issue

I have been unable to get conda-forge pyspark working out of the box, and have spent a couple of days figuring out what's going wrong. I am not versed enough to make a PR for myself, nor confident enough that this problem isn't observed by everyone to merit that PR. Regardless, I hope the info I put here is useful to the devs, or at least to people like me who are having trouble getting it working.

My process for installing pyspark locally:

install miniconda (local user)
open miniconda promp, and run:
- conda create -n pyspark_env
- conda activate pyspark_env
- conda install -c conda-forge pyspark openjdk
- conda install findspark
then see steps below required to fix the suite of errors

There are four main issues:

java is not listed as a dependency for pyspark, which will resolve in a "java not found" error on launching pyspark.
- "conda install openjdk" before/after you install pyspark does the trick.
winutils.exe is missing from SPARK_HOME (C:\Users\XXXXXXXX\Miniconda3\envs\pyspark_env\Lib\site-packages\pyspark\bin). This results in a WARNING when pyspark is run in shell ("Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries")
- Given the version of hadoop installed with pyspark, I downloaded winutils.exe from here and put it in the directory: https://github.com/cdarlint/winutils/blob/master/hadoop-2.7.3/bin/winutils.exe
- A system environmental variable (not local, despite my miniconda being installed for the local user) had to be created called HADOOP_HOME, and set to the same as SPARK_HOME (obviously, this won't work when switching virtual environments, but you get the idea).
findspark is required to link python to spark. Without first running import findspark; findspark.init(), the error is thrown: "Python worker failed to connect back" on some pyspark commands
spark version 2.4 (installed with pyspark) has a bug in it that fails when run on windows, resulting in an error ModuleNotFound error for "resource" when some pyspark commands are used.
- the following changes need to be applied: https://github.com/apache/spark/pull/23055/files#diff-17ed18489a956f326ec0fe4040850c5bc9261d4631fb42da4c52891d74a59180\
- apply them to worker.py in SPARK_HOME, and in SPARK_HOME/python/lib/pyspark.zip

I'm happy to elaborate or provide clearer errors/steps as needed.

Installed packages

# packages in environment at C:\Users\XXXXXXXX\Miniconda3\envs\pyspark_env:
#
# Name                    Version                   Build  Channel
argon2-cffi               20.1.0           py36h2bbff1b_1
async_generator           1.10             py36h28b3542_0
attrs                     21.4.0             pyhd3eb1b0_0
backcall                  0.2.0              pyhd3eb1b0_0
bleach                    4.1.0              pyhd3eb1b0_0
ca-certificates           2022.3.29            haa95532_0
certifi                   2021.5.30        py36haa95532_0
cffi                      1.14.6           py36h2bbff1b_0
colorama                  0.4.4              pyhd3eb1b0_0
decorator                 5.1.1              pyhd3eb1b0_0
defusedxml                0.7.1              pyhd3eb1b0_0
entrypoints               0.3                      py36_0
findspark                 2.0.1              pyhd8ed1ab_0    conda-forge
icu                       58.2                 ha925a31_3
intel-openmp              2022.0.0          h57928b3_3663    conda-forge
ipykernel                 5.3.4            py36h5ca1d4c_0
ipython                   7.16.1           py36h5ca1d4c_0
ipython_genutils          0.2.0              pyhd3eb1b0_1
ipywidgets                7.6.5              pyhd3eb1b0_1
jedi                      0.17.0                   py36_0
jinja2                    3.0.3              pyhd3eb1b0_0
jpeg                      9d                   h2bbff1b_0
jsonschema                3.0.2                    py36_0
jupyter                   1.0.0                    py36_7
jupyter_client            7.1.2              pyhd3eb1b0_0
jupyter_console           6.4.3              pyhd3eb1b0_0
jupyter_core              4.8.1            py36haa95532_0
jupyterlab_pygments       0.1.2                      py_0
jupyterlab_widgets        1.0.0              pyhd3eb1b0_1
libblas                   3.9.0              14_win64_mkl    conda-forge
libcblas                  3.9.0              14_win64_mkl    conda-forge
liblapack                 3.9.0              14_win64_mkl    conda-forge
libpng                    1.6.37               h2a8f88b_0
m2w64-gcc-libgfortran     5.3.0                         6
m2w64-gcc-libs            5.3.0                         7
m2w64-gcc-libs-core       5.3.0                         7
m2w64-gmp                 6.1.0                         2
m2w64-libwinpthread-git   5.0.0.4634.697f757               2
markupsafe                2.0.1            py36h2bbff1b_0
mistune                   0.8.4            py36he774522_0
mkl                       2022.0.0           h0e2418a_796    conda-forge
msys2-conda-epoch         20160418                      1
nbclient                  0.5.3              pyhd3eb1b0_0
nbconvert                 6.0.7                    py36_0
nbformat                  5.1.3              pyhd3eb1b0_0
nest-asyncio              1.5.1              pyhd3eb1b0_0
notebook                  6.4.3            py36haa95532_0
numpy                     1.19.5           py36h4b40d73_2    conda-forge
openjdk                   11.0.13              h2bbff1b_0
openssl                   1.1.1n               h2bbff1b_0
packaging                 21.3               pyhd3eb1b0_0
pandas                    0.25.3           py36he350917_0    conda-forge
pandoc                    2.12                 haa95532_0
pandocfilters             1.5.0              pyhd3eb1b0_0
parso                     0.8.3              pyhd3eb1b0_0
pickleshare               0.7.5           pyhd3eb1b0_1003
pip                       20.0.2                   py36_1    conda-forge
prometheus_client         0.13.1             pyhd3eb1b0_0
prompt-toolkit            3.0.20             pyhd3eb1b0_0
prompt_toolkit            3.0.20               hd3eb1b0_0
py4j                      0.10.8.1                 py36_0
pycparser                 2.21               pyhd3eb1b0_0
pygments                  2.11.2             pyhd3eb1b0_0
pyparsing                 3.0.4              pyhd3eb1b0_0
pyqt                      5.9.2            py36h6538335_2
pyrsistent                0.17.3           py36he774522_0
pyspark                   2.4.0                 py36_1000    conda-forge
python                    3.6.15          h39d44d4_0_cpython    conda-forge
python-dateutil           2.8.2              pyhd8ed1ab_0    conda-forge
python_abi                3.6                     2_cp36m    conda-forge
pytz                      2022.1             pyhd8ed1ab_0    conda-forge
pywin32                   228              py36hbaba5e8_1
pywinpty                  0.5.7                    py36_0
pyzmq                     22.2.1           py36hd77b12b_1
qt                        5.9.7            vc14h73c81de_0
qtconsole                 5.2.2              pyhd3eb1b0_0
qtpy                      2.0.1              pyhd3eb1b0_0
send2trash                1.8.0              pyhd3eb1b0_1
setuptools                49.6.0           py36ha15d459_3    conda-forge
sip                       4.19.8           py36h6538335_0
six                       1.16.0             pyh6c4a22f_0    conda-forge
sqlite                    3.38.2               h2bbff1b_0
tbb                       2021.5.0             h2d74725_1    conda-forge
terminado                 0.9.4            py36haa95532_0
testpath                  0.5.0              pyhd3eb1b0_0
tornado                   6.1              py36h2bbff1b_0
traitlets                 4.3.3            py36haa95532_0
ucrt                      10.0.20348.0         h57928b3_0    conda-forge
vc                        14.2                 hb210afc_6    conda-forge
vs2015_runtime            14.29.30037          h902a5da_6    conda-forge
wcwidth                   0.2.5              pyhd3eb1b0_0
webencodings              0.5.1                    py36_1
wheel                     0.37.1             pyhd8ed1ab_0    conda-forge
widgetsnbextension        3.5.1                    py36_0
wincertstore              0.2             py36ha15d459_1006    conda-forge
winpty                    0.4.3                         4
zlib                      1.2.12               h8cc25b3_1

Environment info

active environment : pyspark_env
    active env location : C:\Users\XXXXXXXX\Miniconda3\envs\pyspark_env
            shell level : 2
       user config file : C:\Users\XXXXXXXX\.condarc
 populated config files :
          conda version : 4.12.0
    conda-build version : not installed
         python version : 3.9.7.final.0
       virtual packages : __win=0=0
                          __archspec=1=x86_64
       base environment : C:\Users\XXXXXXXX\Miniconda3  (writable)
      conda av data dir : C:\Users\XXXXXXXX\Miniconda3\etc\conda
  conda av metadata url : None
           channel URLs : https://repo.anaconda.com/pkgs/main/win-64
                          https://repo.anaconda.com/pkgs/main/noarch
                          https://repo.anaconda.com/pkgs/r/win-64
                          https://repo.anaconda.com/pkgs/r/noarch
                          https://repo.anaconda.com/pkgs/msys2/win-64
                          https://repo.anaconda.com/pkgs/msys2/noarch
          package cache : C:\Users\XXXXXXXX\Miniconda3\pkgs
                          C:\Users\XXXXXXXX\.conda\pkgs
                          C:\Users\XXXXXXXX\AppData\Local\conda\conda\pkgs
       envs directories : C:\Users\XXXXXXXX\Miniconda3\envs
                          C:\Users\XXXXXXXX\.conda\envs
                          C:\Users\XXXXXXXX\AppData\Local\conda\conda\envs
               platform : win-64
             user-agent : conda/4.12.0 requests/2.27.1 CPython/3.9.7 Windows/10 Windows/10.0.19043
          administrator : False
             netrc file : None
           offline mode : False

j-4 commented 2 years ago

Hi @Nuglar :-) Which versions of the pyspark package did you try and for which versions did you experience the troubles? You only tested on the windows platform, right?

I extracted four major points from your text:

Java/OpenJDK missing as dependency -> I am not a fan of depending on openjdk, as people might want to use any other java flavor... then depending on the packaged openjdk would be quite annoying. As the error message is quite clear, i would leave it like that and trust that people know what to do.
Installation with Hadooop on Windows has two issues: env variables and winutils.exe missing. The pyspark package does not depend on Hadoop, thus i think this is not something that should be fixed here.

findspark missing: I cannot reproduce the problem. I run a python shell with

>>> import pyspark
>>> spark = pyspark.sql.SparkSession.builder.getOrCreate()
22/07/25 13:24:28 WARN Utils: Your hostname ...
22/07/25 13:24:28 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/07/25 13:24:29 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
>>> spark.createDataFrame([(1,)], schema=['id']).show()
+---+
| id|
+---+
|  1|
+---+

Furthermore, I do not understand why you would need the findspark package, as $CONDA_PREFIX/lib/python3.8/site-packages/pyspark/find_spark_home.py is already packaged with pyspark.

https://issues.apache.org/jira/browse/SPARK-26080 is already fixed upstream for all recent versions

h-vetinari commented 2 years ago

Java/OpenJDK missing as dependency -> I am not a fan of depending on openjdk, as people might want to use any other java flavor... then depending on the packaged openjdk would be quite annoying. As the error message is quite clear, i would leave it like that and trust that people know what to do.

I have to say I disagree quite strongly - the package is not functional without openjdk, and we should ship it.

Regarding different java flavours, that's definitely not the default, and should IMO be opt-out, e.g. there could be an output pyspark-base or pyspark-no-openjdk that provides the package without openjdk.

Regarding wanting to use different java versions, it's possible to package against different java versions (c.f. pyjnius). In the case where I just encountered this, the error message is not clear, particularly if the machine already has a JAVA_HOME that's somehow populated.

h-vetinari commented 2 years ago

I have to say I disagree quite strongly - the package is not functional without openjdk, and we should ship it.

This is exacerbated by the fact that the new java 17 LTS is only compatible with the very recent pyspark >=3.3, and anyone installing an older pyspark and doing "just add openjdk" will run into another failure mode.

jonas-w commented 1 year ago

Will this ever be considered to get implemented? It's a PITA to get pyspark running on Windows.

A system environmental variable (not local, despite my miniconda being installed for the local user) had to be created called HADOOP_HOME, and set to the same as SPARK_HOME (obviously, this won't work when switching virtual environments, but you get the idea).

Especially needing to change these environment variables in the system settings everytime you want to use a new environment, and copying winutils and hadoop to the newly installed pyspark instance.

h-vetinari commented 1 year ago

Will this ever be considered to get implemented?

If it doesn't break the average use-case, then of course it will be considered! PRs are always welcome. :)

Especially needing to change these environment variables in the system settings everytime you want to use a new environment, and copying winutils and hadoop to the newly installed pyspark instance.

That's possible to fix with activation scripts.

astrojuanlu commented 2 months ago

Thanks for the openjdk tip, I was puzzled initially. Could this be solved with "variants", like MKL/OpenBLAS for libblas? https://conda-forge.org/docs/maintainer/knowledge_base/#switching-blas-implementation, https://github.com/conda-forge/blas-feedstock/blob/main/recipe/meta.yaml

conda-forge / pyspark-feedstock