databricks / spark-deep-learning

Deep Learning Pipelines for Apache Spark
https://databricks.github.io/spark-deep-learning
Apache License 2.0
1.99k stars 494 forks source link

How to use "sparkdl$ SPARK_HOME=/usr/local/lib/spark-2.1.1-bin-hadoop2.7 PYSPARK_PYTHON=python2 SCALA_VERSION=2.11.8 SPARK_VERSION=2.1.1 ./python/run-tests.sh " #60

Open RayTsui opened 7 years ago

RayTsui commented 7 years ago

I follow the instructions: download the project and use build/sbt assembly and then I execute the python/run-tests.sh, but it gives me the following info:

List of assembly jars found, the last one will be used: ls: /Users/lei.cui/Documents/Workspace/DeepLearninginApacheSpark/spark-deep-learning-master/python/../target/scala-2.12/spark-deep-learning-assembly*.jar: No such file or directory

============= Searching for tests in: /Users/lei.cui/Documents/Workspace/DeepLearninginApacheSpark/spark-deep-learning-master/python/tests ============= ============= Running the tests in: /Users/lei.cui/Documents/Workspace/DeepLearninginApacheSpark/spark-deep-learning-master/python/tests/graph/test_builder.py ============= /usr/local/opt/python/bin/python2.7: No module named nose

Actually, after sbt building, it produces the scala-2.11/spark-deep-learning-assembly.jar instead of scala-2.12/spark-deep-learning-assembly.jar. In addition, I installed the python2 at the /usr/local/bin/python2, why it will have /usr/local/opt/python/bin/python2.7: No module named nose.

RayTsui commented 7 years ago

Actually, I am not sure how to use the "sparkdl$ SPARK_HOME=/usr/local/lib/spark-2.1.1-bin-hadoop2.7 PYSPARK_PYTHON=python2 SCALA_VERSION=2.11.8 SPARK_VERSION=2.1.1 ./python/run-tests.sh ", can it be executed at the command line, but it will give "sparkdl$: command not found".

allwefantasy commented 7 years ago

sparkdl$ means your current directory is spark deep learning project. SPARK_HOME is need by pyspark , SCALA_VERSION and SPARK_VERSION are used to locate the spark-deep-learning-assembly*.jar.

./python/run-tests.sh will setup enviroment and find all py in python/tests and run them one by one.

you should run command build/sbt assembly first to make sure assembly jar is ready ,then run SPARK_HOME=/usr/local/lib/spark-2.1.1-bin-hadoop2.7 PYSPARK_PYTHON=python2 SCALA_VERSION=2.11.8 SPARK_VERSION=2.1.1 ./python/run-tests.sh

phi-dbq commented 7 years ago

@RayTsui thank you for reporting the issue! @allwefantasy thank you for helping out! In addition, we also have some scripts/sbt-plugins we use to facilitate development process, which we put in https://github.com/databricks/spark-deep-learning/pull/59. You can try running SPARK_HOME="path/to/your/spark/home/directory" ./bin/totgen.sh which will generate pyspark (.py2.spark.shell, .py3.spark.shell) and spark-shell (.spark.shell) REPLs.

RayTsui commented 7 years ago

@allwefantasy Thanks a lot for your answer, actually, as for the command "SPARK_HOME=/usr/local/lib/spark-2.1.1-bin-hadoop2.7 PYSPARK_PYTHON=python2 SCALA_VERSION=2.11.8 SPARK_VERSION=2.1.1 ./python/run-tests.sh", I have few doubts, 1) the value for each config is fixed and common to all envs, or I need to set the value based on my current env, because I install spark via "brew install apache-spark" instead of downloading the spark with its dependency hadoop(e.g., spark-2.1.1-bin/hadoop). In addition, version number for scala and spark is also based on my env? 2) do I need to set env variable "SPARK_HOME=/usr/local/lib/spark-2.1.1-bin-hadoop2.7 PYSPARK_PYTHON=python2 SCALA_VERSION=2.11.8 SPARK_VERSION=2.1.1 " in ~/.bash_profile or I directly run the command "RK_HOME=/usr/local/lib/spark-2.1.1-bin-hadoop2.7 PYSPARK_PYTHON=python2 SCALA_VERSION=2.11.8 SPARK_VERSION=2.1.1 ./python/run-tests.sh" at the prompt.

3) after tentative attempt, I still came cross the errors above.

if you have some suggestions, It will help me a lot.

RayTsui commented 7 years ago

@phi-dbq Thanks a lot for your response, I will try to what you refer and give necessary feedback.

allwefantasy commented 7 years ago
  1. To make sure you have the dependencies in the following list are installed :
# This file should list any python package dependencies.
coverage>=4.4.1
h5py>=2.7.0
keras==2.0.4 # NOTE: this package has only been tested with keras 2.0.4 and may not work with other releases
nose>=1.3.7  # for testing
numpy>=1.11.2
pillow>=4.1.1,<4.2
pygments>=2.2.0
tensorflow==1.3.0
pandas>=0.19.1
six>=1.10.0
kafka-python>=1.3.5
tensorflowonspark>=1.0.5
tensorflow-tensorboard>=0.1.6

Or you can just run command to finish this:

 pip2 install -r python/requirements.txt

2.Just keep PYSPARK_PYTHON=python2 SCALA_VERSION=2.11.8 SPARK_VERSION=2.1.1 no change. As I have mentioned, these envs are just for locating the assembly jar. The only env you should set is SPARK_HOME. I suggest that you should not configure them in .bashrc which may have side effect in your other program.

  1. Run command as the following steps:

step 1:

      build/sbt assembly

then you should find the spark-deep-learning-assembly-0.1.0-spark2.1.jar in your-project/target/scala-2.11.

step 2:

 SCALA_VERSION=2.11.8 SPARK_VERSION=2.1.1 ./python/run-tests.sh 

Also,you can specify the target file to run instead of the all the files which almost take 30m. Like this:

 SCALA_VERSION=2.11.8 SPARK_VERSION=2.1.1 ./python/run-tests.sh  /Users/allwefantasy/CSDNWorkSpace/spark-deep-learning/python/tests/transformers/tf_image_test.py
RayTsui commented 7 years ago

@allwefantasy
Hi, I am really appreciated for your explanation, I understood and repeated again, it moves a lot progress, at least the unit test can cover Name Stmts Miss Cover

sparkdl/graph/init.py 0 0 100% sparkdl/graph/utils.py 81 64 21% sparkdl/image/init.py 0 0 100% sparkdl/image/imageIO.py 94 66 30% sparkdl/transformers/init.py 0 0 100% sparkdl/transformers/keras_utils.py 13 7 46% sparkdl/transformers/param.py 46 26 43%

TOTAL 234 163 30%

But there still exists some error as follows:

ModuleNotFoundError: No module named 'tensorframes'

I guess that the tensorframes can officially support linux 64, but right now I use the mac OS, is that the issue?

thunterdb commented 7 years ago

Hello @RayTsui , I have no problem using OSX for development purposes. Can you run first:

build/sbt clean

followed by:

build/sbt assembly

You should see a line that writes: [info] Including: tensorframes-0.2.9-s_2.11.jar this indicates that tensorframes is properly included in the assembly jar, and that your problem is rather that the proper assembly cannot be found.

RayTsui commented 6 years ago

@thunterdb Thanks a lot for your suggestions. I ran the commands, yes I can see the [info] Including: tensorframes-0.2.8-s_2.11.jar. And as you said, my issue is about "List of assembly jars found, the last one will be used: ls: $DIR/spark-deep-learning-master/python/../target/scala-2.11/spark-deep-learning-assembly*.jar: No such file or directory"

I suppose that all related jars are packaged in spark-deep-learning-assembly.jar, but my spark-deep-learning-master-assembly-0.1.0-spark2.1.jar is generated at the path "$DIR/spark-deep-learning-master/target/scala-2.11/spark-deep-learning-master-assembly-0.1.0-spark2.1.jar" instead of "$DIR/spark-deep-learning-master/python/../target/scala-2.11/spark-deep-learning-assembly.jar". And I tried to modified the segment of the run-tests.sh file, but it does not work.

Do you know how to locate the spark-deep-learning-master-assembly-0.1.0-spark2.1.jar?