In this code pattern historical shopping data is analyzed with Spark and PixieDust. The data is loaded, cleaned and then analyzed by creating various charts and maps.
When you have completed this code patterns, you will understand how to:
The intended audience is anyone interested in quickly analyzing data in a Jupyter notebook.
Log into IBM's Watson Studio. Once in, you'll land on the dashboard.
Create a new project by clicking + New project
and choosing Data Science
:
Enter a name for the project name and click Create
.
NOTE: By creating a project in Watson Studio a free tier Object Storage
service and Watson Machine Learning
service will be created in your IBM Cloud account. Select the Free
storage type to avoid fees.
Upon a successful project creation, you are taken to a dashboard view of your project. Take note of the Assets
and Settings
tabs, we'll be using them to associate our project with any external assets (datasets and notebooks) and any IBM cloud services.
From the new project Overview
panel, click + Add to project
on the top right and choose the Notebook
asset type.
Fill in the following information:
From URL
tab. [1]Name
for the notebook and optionally a description. [2]Notebook URL
provide the following url: https://raw.githubusercontent.com/IBM/analyze-customer-data-spark-pixiedust/master/notebooks/analyze-customer-data.ipynb [3]Runtime
select the Spark Python 3.6
option. [4]Click the Create
button.
TIP: Once successfully imported, the notebook should appear in the Notebooks
section of the Assets
tab.
Run the cells one at a time. Select the first cell and press the (►) Run
button to start stepping through the notebook.
Load the data set customers_orders1_opt.csv into the notebook.
Before analyzing the data, it needs to be cleaned and formatted. This can be done with a few pyspark commands:
Select only the columns you are interested in with df.select()
Convert the AGE column to a numeric data type so you can run calculations on customer age with a user defined function (udf).
Derive the gender information for each customer based on the salutation and rename the GenderCode column to GENDER with a second udf
.
The data can now be explored with PixieDust:
With display()
explore the data in a table.
Then click on the below button to create one of the charts in the list.
Drag and drop the variables you want to display into the Keys
and Values
fields. Select the aggregation from the drop-down menu and click OK
.
From the menu on the right of the chart you can select which renderer you want to use, where each one of them visualises the data in a different way. Other options are clustering by a variable, the size and orientation of the chart and the display of a legend.
Below are two examples of a bar chart and a map created in the notebook.
Histogram
Map
This code pattern is licensed under the Apache Software License, Version 2. Separate third party code objects invoked within this code pattern are licensed by their respective providers pursuant to their own separate licenses. Contributions are subject to the Developer Certificate of Origin, Version 1.1 (DCO) and the Apache Software License, Version 2.