aloknsingh / ibm_r4ml_biganalytics

Apache License 2.0
0 stars 11 forks source link

Big Data Preparation and Exploration using R4ML

In this Code Pattern we will use R4ML, a scalable R package, running on IBM Watson Studio to perform various Machine Learning exercises. For those users who are unfamiliar with Watson Studio, it is an interactive, collaborative, cloud-based environment where data scientists, developers, and others interested in data science can use tools (e.g., RStudio, Jupyter Notebooks, Spark, etc.) to collaborate, share, and gather insight from their data.

When the reader has completed this Code Pattern, they will understand how to:

The Intended audience for this Code Pattern is data scientists who wish to perform scalable feature engineering and data exploration.

R4ML provides various out-of-the-box tools, and a preprocessing utility for doing the feature engineering. It also provides utilities to sample data and do exploratory analysis. This specific Code Pattern will provide an end-to-end example to demonstate the ease and power of R4ML in implementing data preprocessing and data exploration. For more information about additional R4ML functionality, support, documentation, and roadmap, please vist R4ML

This Code Pattern will walk the user through the following conceptual steps:

Source of data

Flow

  1. Load the provided notebook into IBM Watson Studio.
  2. The notebook interacts with an Apache Spark instance.
  3. A sample big data dataset is loaded into a Jupyter Notebook.
  4. R4ML, running atop Apache Spark, is used to perform machine data preprocessing and exploratory analysis.

Included Components

Featured Technologies

Steps

This Code Pattern consists of following activities:

Run Jupyter notebooks in the IBM Watson Studio

  1. Sign up for the Watson Studio
  2. Create a new Watson Studio project
  3. Create the notebooks
  4. Run the notebooks
  5. Save and Share

1. Sign up for the Watson Studio

Log in or sign up for IBM's Watson Studio.

Note: if you would prefer to skip the remaining Watson Studio set-up steps and just follow along by viewing the completed Notebook, simply:

  • View the completed notebooks and its outputs, as is. In this Code Pattern, there are two notebooks. The first notebook is for exploring, and the second notebook performs data pre-processing and deminsion reduction analysis.
  • While viewing the notebook, you can optionally download it to store for future use.
  • When complete, continue this code pattern by jumping ahead to the Explore and Analyze the Data section.

2. Create a new Watson Studio project

3. Create the Notebooks

Note: For this Code Pattern, set language to R and Spark version 2.1

https://github.com/IBM/r4ml-on-watson-studio/tree/master/notebooks/R4ML_Introduction_Exploratory_DataAnalysis.ipynb
https://github.com/IBM/r4ml-on-watson-studio/tree/master/notebooks/R4ML_Data_Preprocessing_and_Dimension_Reduction.ipynb

5. Run the notebooks

First run the exploratory nodebook first. Once Complete, run the data processing notebook.

When a notebook is executed, what is actually happening is that each code cell in the notebook is executed, in order, from top to bottom.

Each code cell is selectable and is preceded by a tag in the left margin. The tag format is In [x]:. Depending on the state of the notebook, the x can be:

There are several ways to execute the code cells in your notebook:

6. Save and Share

How to save your work:

Under the File menu, there are several ways to save your notebook:

How to share your work:

You can share your notebook by selecting the Share button located in the top right section of your notebook panel. The end result of this action will be a URL link that will display a “read-only” version of your notebook. You have several options to specify exactly what you want shared from your notebook:

Explore and Analyze the Data

Analysis Section:

Scalable R4ML Key Features:

Content

Sample output

The following screen-shots shows the histogram of the exploratory analysis .

Exploratory Analysis Histogram

The following screen-shots shows the correlation between various features of the exploratory analysis .

Exploratory Analysis Correlation between various features

The following screen-shots shows the output of the dimensionality reduction using PCA and how only 6 components of PCA carries 90% of information.

Dimension Reduction using PCA

Awesome job following along! Now go try and take this further or apply it to a different use case!

Links

Learn more

License

Apache 2.0