In this Code Pattern we will use R4ML, a scalable R package, running on IBM Watson Studio to perform various Machine Learning exercises. For those users who are unfamiliar with Watson Studio, it is an interactive, collaborative, cloud-based environment where data scientists, developers, and others interested in data science can use tools (e.g., RStudio, Jupyter Notebooks, Spark, etc.) to collaborate, share, and gather insight from their data.
When the reader has completed this Code Pattern, they will understand how to:
The Intended audience for this Code Pattern is data scientists who wish to perform scalable feature engineering and data exploration.
R4ML provides various out-of-the-box tools, and a preprocessing utility for doing the feature engineering. It also provides utilities to sample data and do exploratory analysis. This specific Code Pattern will provide an end-to-end example to demonstate the ease and power of R4ML in implementing data preprocessing and data exploration. For more information about additional R4ML functionality, support, documentation, and roadmap, please vist R4ML
This Code Pattern will walk the user through the following conceptual steps:
This Code Pattern consists of following activities:
Log in or sign up for IBM's Watson Studio.
Note: if you would prefer to skip the remaining Watson Studio set-up steps and just follow along by viewing the completed Notebook, simply:
- View the completed notebooks and its outputs, as is. In this Code Pattern, there are two notebooks. The first notebook is for exploring, and the second notebook performs data pre-processing and deminsion reduction analysis.
- While viewing the notebook, you can optionally download it to store for future use.
- When complete, continue this code pattern by jumping ahead to the Explore and Analyze the Data section.
New Project
option from the Watson Studio landing page and choose the Data Science
option.Cloud Object Storage
service or select an existing one from your IBM Cloud account.Assets
and Settings
tabs, we'll be using them to associate our project with any external assets (datasets and notebooks) and any IBM cloud services.Assets
tab, click the + New notebook
button.Note: For this Code Pattern, set language to
R
andSpark
version 2.1
From URL
tab to specify the URL to the notebook in this repository.https://github.com/IBM/r4ml-on-watson-studio/tree/master/notebooks/R4ML_Introduction_Exploratory_DataAnalysis.ipynb
Click the Create
button.
Repeat these steps for creating the second notebook, which has the URL:
https://github.com/IBM/r4ml-on-watson-studio/tree/master/notebooks/R4ML_Data_Preprocessing_and_Dimension_Reduction.ipynb
First run the exploratory nodebook first. Once Complete, run the data processing notebook.
When a notebook is executed, what is actually happening is that each code cell in the notebook is executed, in order, from top to bottom.
Each code cell is selectable and is preceded by a tag in the left margin. The tag
format is In [x]:
. Depending on the state of the notebook, the x
can be:
*
, this indicates that the cell is currently executing.There are several ways to execute the code cells in your notebook:
Play
button in the toolbar.Cell
menu bar, there are several options available. For example, you
can Run All
cells in your notebook, or you can Run All Below
, that will
start executing from the first cell under the currently selected cell, and then
continue executing all cells that follow.Schedule
button located in the top right section of your notebook
panel. Here you can schedule your notebook to be executed once at some future
time, or repeatedly at your specified interval.Under the File
menu, there are several ways to save your notebook:
Save
will simply save the current state of your notebook, without any version
information.Save Version
will save your current state of your notebook with a version tag
that contains a date and time stamp. Up to 10 versions of your notebook can be
saved, each one retrievable by selecting the Revert To Version
menu item.You can share your notebook by selecting the Share
button located in the top
right section of your notebook panel. The end result of this action will be a URL
link that will display a “read-only” version of your notebook. You have several
options to specify exactly what you want shared from your notebook:
Only text and output
: will remove all code cells from the notebook view.All content excluding sensitive code cells
: will remove any code cells
that contain a sensitive tag. For example, # @hidden_cell
is used to protect
your credentials from being shared.All content, including code
: displays the notebook as is.download as
options are also available in the menu.R4ML is a git downloadable open-source R package from IBM
Created on top of SparkR and Apache SystemML (so it supports features from both)
Acts as a R bridge between SparkR and Apache SystemML
Provides a collection of canned algorithms
Provides the ability to create custom ML algorithms
Provides both SparkR and Apache SystemML functionality
APIs are friendlier to the R user
We will first load the package and data and do the initial transformation and various feature engineering
We will sample the dataset and use the powerful ggplot2 library from R to do various exploratory analysis in exploratory analysis notebook.
In the end, we will run PCA to reduce the dimension of the dataset and select the k components to cover 90% of variance in dimension reduction notebook.
More details are in the notebooks
The following screen-shots shows the histogram of the exploratory analysis .
The following screen-shots shows the correlation between various features of the exploratory analysis .
The following screen-shots shows the output of the dimensionality reduction using PCA and how only 6 components of PCA carries 90% of information.
Awesome job following along! Now go try and take this further or apply it to a different use case!