Hi Everyone,
The goal of this workshop is to guide you on how to use PySpark to do customer analytics. Our target group are Spark & analytics beginners that want to get familiar with PySpark and data science. If you already are an avid Spark enthusiast and have data for breakfast everyday you might find this workshop too basic for you!
During this workshop, we will solve two tasks:
During the workshop (and after some mandatory pizza and drinks, nobody can code on an empty stomach), we will first describe each exercise and then we will let you solve them. In case you have any doubts or you get stuck, our colleagues will help you keep going and will give you some invaluable advice (don't hesitate to approach them!). After some time, we will discuss the results together and explore different solutions.
Since we have a limited time to finish the workshop, please make sure that you:
%pyspark
).One of the most important tasks we must accomplish before doing any data analysis is getting to know the data we are working on. We have chosen a small open-source marking dataset from a Portuguese bank for this workshop. The customers in the dataset were approached on the phone regarding a new product offered by the bank (term deposit). There is a description of the dataset in this UCI repository page. We have added a cell in the tasks to download the data in parquet format from the server we've put together for this workshop so there is no need for you to download the data from the UCI repository.
The dataset contains information about the client demographics and whether they hired the new product or not.
This section describes how to install PySpark and Zeppelin notebooks for Windows, MacOS and Linux. If you are already using them you can skip this section.
Please make sure you have Homebrew installed on your machine. You will also need to have XCode installed (you can download it from the App Store), along with the command line tools that you can install with:
xcode-select --install
Before installing Spark and Zeppelin, you must have Python installed on your computer. Python 3.6 might cause some problems depending on the PySpark version you have on your machine, so we will use Python 3.5 on this workshop.
If you have brew
already on your system, just type:
brew install pyenv
pyenv install 3.5.5
It will install Python in $HOME/.pyenv/versions/3.5.5
. Now, you must set the Python version you want to use with PySpark:
export PYSPARK_PYTHON=$HOME/.pyenv/versions/3.5.5/bin/python3.5
Task 2.json
uses the requests
library to upload the model scores to our ranking server. Please install it using pip
:
pip3 install --upgrade requests
apache-spark
:
brew update
brew upgrade
brew install apache-spark
Once the installation is finished, you should be able to run the PySpark shell:
pyspark
Please check which version of Python the shell is using. Take a look to the
first line that appears after typing pyspark
:
Python 3.5.5 (...)
[GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.39.2)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
2018-03-04 19:24:04 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2018-03-04 19:24:07 WARN Utils:66 - Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
2018-03-04 19:24:07 WARN Utils:66 - Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
2018-03-04 19:24:07 WARN Utils:66 - Service 'SparkUI' could not bind on port 4042. Attempting port 4043.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.3.0
/_/
Using Python version 3.5.5 (default, Mar 4 2018 19:16:41)
SparkSession available as 'spark'.
>>>
If you need a further explanation you can follow the instructions here.
Zeppelin can be found on brew
. Install it by typing:
brew install apache-zeppelin
Alternatively, you can download the latest version of Zeppelin here.
When Zeppelin is installed with brew
, you can launch it from anywhere with the
command:
zeppelin-daemon.sh start
To interrupt Zeppelin, type:
zeppelin-daemon.sh stop
If you downloaded the Zeppelin archive, uncompress it and jump into its folder. Then type:
./bin/zeppelin-daemon.sh start
Once the daemon is running, you'll be able to access the notebook opening from
your browser the address http://localhost:8080
. It usually takes a few minutes
to startup, so don't rush it!
You need to have Java 7 or 8, and Scala installed. You can check your Java version with:
$ java -version
In case the Java version does not match with the requirements, you can install version 8 with the following:
$ sudo apt-add-repository ppa:webupd8team/java
$ sudo apt-get update
$ sudo apt-get install oracle-java7-installer
Scala is another requirement for PySpark. You can download the latest version of Scala from the official Scala archive. Alternatively, you can run the following commans:
$ wget http://www.scala-lang.org/files/archive/scala-2.12.5.deb
$ sudo dpkg -i scala-2.12.5.deb
$ rm scala-2.12.5.deb
To check that the installation was successful, just run:
$ scala -version
A virtual environment is recommended in order to install all the Python dependencies. You can use the environment manager of your choice, but make sure to use Python version 3.5 as the latest 3.6 will probably not work.
Once the virtual environment is ready, install all the packages with pip
:
(venv) $ pip install py4j
(venv) $ pip install pyspark
When it's done, check the installation:
(venv) $ pyspark
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
2018-03-24 10:43:03 WARN Utils:66 - Your hostname, brainBox resolves to a loopback address: 127.0.1.1; using 192.168.0.158 instead (on interface wlp58s0)
2018-03-24 10:43:03 WARN Utils:66 - Set SPARK_LOCAL_IP if you need to bind to another address
2018-03-24 10:43:03 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2018-03-24 10:43:04 WARN Utils:66 - Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.3.0
/_/
Using Python version 3.5.2 (default, Nov 23 2017 16:37:01)
SparkSession available as 'spark'.
>>>
Task 2.json
uses the requests
library to upload the model scores to our ranking server. Please install it using pip
:
pip3 install --upgrade requests
You can download the Zeppelin Notebook fom the Apache Zeppelin download page (the binary package with the Spark interpreter will do). From command line, just download the Zeppelin archive and extract it:
$ wget http://ftp.heanet.ie/mirrors/www.apache.org/dist/zeppelin/zeppelin-0.7.3/zeppelin-0.7.3-bin-netinst.tgz
$ tar xvf zeppelin-0.7.3-bin-netinst.tgz
Once the archive is available, run the Zeppelin Notebook like this (remember to run Zeppelin when the virtual environment is activted!):
(venv) $ cd zeppelin-0.7.3-bin-netinst/bin
(venv) $ ./zeppelin-daemon.sh start
It will take a while for Zeppelin to actually start. You can check whether it's
online by opening a browser at http://localhost:8080
, or you can use the same
script to query for the status:
(venv) $ ./zeppelin-daemon.sh status
When you're done with the notebook, just run:
(venv) $ ./zeppelin-daemon.sh stop