AlexIoannides / pyspark-example-project

Implementing best practices for PySpark ETL jobs and applications.
1.56k stars 672 forks source link

import sklearn fails #13

Closed divayjindal95 closed 4 years ago

divayjindal95 commented 5 years ago

Hi, I am referring to your project in order to write ETL apps using pyspark. I am just importing sklearn in my app. I am running spark-submit locally.

jobs/etl_job.py fails with the following error:


Traceback (most recent call last):
  File "/Users/divay/Documents/pyspark-example-project/jobs/etl_job.py", line 41, in <module>
    import sklearn
  File "/private/var/folders/3q/zy4z346d5dv7g6r9q_f0z2jw0000gp/T/pip-install-fe4mp1/scikit-learn/sklearn/__init__.py", line 63, in <module>
  File "/private/var/folders/3q/zy4z346d5dv7g6r9q_f0z2jw0000gp/T/pip-install-fe4mp1/scikit-learn/sklearn/__check_build/__init__.py", line 46, in <module>
  File "/private/var/folders/3q/zy4z346d5dv7g6r9q_f0z2jw0000gp/T/pip-install-fe4mp1/scikit-learn/sklearn/__check_build/__init__.py", line 26, in raise_build_error
OSError: [Errno 20] Not a directory: '/Users/divay/Documents/pyspark-example-project/packages.zip/sklearn/__check_build'
AlexIoannides commented 4 years ago

Sorry for the late reply - I've only just seen this for some reason.

If you haven't already figured it out, the problem is that you can't package scikit-learn in this way - you'll have to manually distribute it to every node of the cluster, because it has a dependency on bumpy, which requires C code to be compiled locally on each node...