UI Desing

Transform View Changes

In this view, you will be able to see to more features supporting machine learning jobs under [Add] in this view, model over stream and model training/persist.

Model over Stream (MOS)

This is to deploy trained the model on stream using Spark/Flink/Kafka streaming. In this case, we need to give following parameters, such as

topic for data input
model information (name/id)
topic for model output

Once deployed, you can see this job in the [Transform] main view with the specific category.

Model Training/Persist (MTP)

This is to launch a job to train our target models and persist it. A detailed guideline (or customized without guide) will be provided for following factors, such as

Training data selection (from query, hive table, or raw files) and sampling
Feature selections
Model selection
Parameters necessary for the model training
Model validation options
Model persist options (on HDFS and/or UDF)
Option to disable guideline and use raw API python/scala/ML_SQL (see reference)

Once the job is submitted, you'll see a job in the [Transform] main view with specific job type/category. The training job can be resubmitted with different input until you satisfy the model and would like to persist it. If the persist option is chosen, this trained model is available in the [Model] view immediately.

New Model View

This is to create a new view of Model in UI. In this view, we are able to do following things.

List all models
List specific model used by all jobs
Update/Delete existing models
Upload new models from HDFS or File
Register new/existing model as UDF (after that it can be used by Spark/Hive SQL)

Back-end Design

We'll mainly leverage Apache Spark MLLib DataFrame-based API (not RDD based) for this feature. The UI guide will generate proper pyspark code and submit it to the Livy/Spark. In terms of new development, we may need following things.

New POPJ created for Model since we need it in code and mongo
New collection for Model in MongoDB
UI edit [Transform] view may need to updated to show training jobs wisely
Additional transform type/category will be needed for the new model jobs.
New views/query are needed for the deployed/persisted model and UDF views.
Create code generator for MLLib code generation, Python or Scala? Use Scala for now.
Spark streaming will be considered for deploying the trained model to live stream.
Identify ways of model persist in Pyspark and how to make it as permanent UDF for spark SQL.

For future considerations, we'll cover some enhancement as follows.

Unified model support for libsvm files and MLeap
Support scikit-learn spark and scikit-learn trained model deployment
Support tensorFlow for spark
Support keep training until correction reach threshold
Support multiple model selections automatically
Support advanced learning algorithms and validation

Reference

Following ML engines considered for the next move in df

Other References

Apply Flink ML to streaming

ML SQL Reference

There is proposed ML SQL reference ease of machine learning pipeline creation

train [hive table/view name] as [ML algorithm] at [mode path]
where [algorithm parameter]
using [feature set]

Example

train hive_book_transact as DecisionTreeClassifier at '/tmp/mode/dt_001/'
where label_col='label' and feature_col='features'
using 
StringIndexer(fx_rate, fx_rate_indexed) and 
StringIndexer(profit, label) and 
VectorAssembler(feature_array, features)

datafibers-community / df_data_service

DF Machine Learning Features Design Specification #169