Cyb3rWard0g / HELK

The Hunting ELK
GNU General Public License v3.0
3.77k stars 682 forks source link

SparkException: ModuleNotFoundError: No module named 'numpy' #226

Closed jamessantiago closed 5 years ago

jamessantiago commented 5 years ago

I've got a simple notebook setup with HELK that pulls in some data from elastic via PySpark SQL and puts it into an RDD vector. When trying to send this data over to an ML job I run into an error. I'm running:

from pyspark.mllib.clustering import KMeans
clusters = KMeans.train(data, 5, maxIterations=10, runs=1, initializationMode="random")

I get the error:

ModuleNotFoundError: No module named 'numpy'

    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:588)
    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:571)
    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
...

I want to go into the spark worker and add the numpy module manually, but I don't know the sparkuser or root password... Info on this:

https://stackoverflow.com/questions/35214231/importerror-no-module-named-numpy-on-spark-workers#

So what are the credentials for helk-spark-worker container?

jamessantiago commented 5 years ago

Well ok, looks like I just need to learn docker more.

Consoling into the container using the "-u root" switch got me into the worker container with privlidges. I then ran a quick apt-get update and then apt-get install python-numpy to load the module I needed. However, that still doesn't get me past the numpy not found issue so I'm not sure where and how I should be loading that module to get this job working. Numpy for python3 installed via pip or apt-get doesn't seem to do the trick either.

jamessantiago commented 5 years ago

Looks like I just needed to get the module installed specifically for 3.7 like so: python3.7 -m pip install numpy