In this view, you will be able to see to more features supporting machine learning jobs under [Add] in this view, model over stream and model training/persist.
Model over Stream (MOS)
This is to deploy trained the model on stream using Spark/Flink/Kafka streaming. In this case, we need to give following parameters, such as
topic for data input
model information (name/id)
topic for model output
Once deployed, you can see this job in the [Transform] main view with the specific category.
Model Training/Persist (MTP)
This is to launch a job to train our target models and persist it. A detailed guideline (or customized without guide) will be provided for following factors, such as
Training data selection (from query, hive table, or raw files) and sampling
Feature selections
Model selection
Parameters necessary for the model training
Model validation options
Model persist options (on HDFS and/or UDF)
Option to disable guideline and use raw API python/scala/ML_SQL (see reference)
Once the job is submitted, you'll see a job in the [Transform] main view with specific job type/category. The training job can be resubmitted with different input until you satisfy the model and would like to persist it. If the persist option is chosen, this trained model is available in the [Model] view immediately.
New Model View
This is to create a new view of Model in UI. In this view, we are able to do following things.
List all models
List specific model used by all jobs
Update/Delete existing models
Upload new models from HDFS or File
Register new/existing model as UDF (after that it can be used by Spark/Hive SQL)
Back-end Design
We'll mainly leverage Apache Spark MLLib DataFrame-based API (not RDD based) for this feature. The UI guide will generate proper pyspark code and submit it to the Livy/Spark. In terms of new development, we may need following things.
New POPJ created for Model since we need it in code and mongo
New collection for Model in MongoDB
UI edit [Transform] view may need to updated to show training jobs wisely
Additional transform type/category will be needed for the new model jobs.
New views/query are needed for the deployed/persisted model and UDF views.
Create code generator for MLLib code generation, Python or Scala? Use Scala for now.
Spark streaming will be considered for deploying the trained model to live stream.
Identify ways of model persist in Pyspark and how to make it as permanent UDF for spark SQL.
For future considerations, we'll cover some enhancement as follows.
There is proposed ML SQL reference ease of machine learning pipeline creation
train [hive table/view name] as [ML algorithm] at [mode path]
where [algorithm parameter]
using [feature set]
Example
train hive_book_transact as DecisionTreeClassifier at '/tmp/mode/dt_001/'
where label_col='label' and feature_col='features'
using
StringIndexer(fx_rate, fx_rate_indexed) and
StringIndexer(profit, label) and
VectorAssembler(feature_array, features)
UI Desing
Transform View Changes
In this view, you will be able to see to more features supporting machine learning jobs under [Add] in this view, model over stream and model training/persist.
Model over Stream (MOS)
This is to deploy trained the model on stream using Spark/Flink/Kafka streaming. In this case, we need to give following parameters, such as
Once deployed, you can see this job in the [Transform] main view with the specific category.
Model Training/Persist (MTP)
This is to launch a job to train our target models and persist it. A detailed guideline (or customized without guide) will be provided for following factors, such as
Once the job is submitted, you'll see a job in the [Transform] main view with specific job type/category. The training job can be resubmitted with different input until you satisfy the model and would like to persist it. If the persist option is chosen, this trained model is available in the [Model] view immediately.
New Model View
This is to create a new view of Model in UI. In this view, we are able to do following things.
Back-end Design
We'll mainly leverage Apache Spark MLLib DataFrame-based API (not RDD based) for this feature. The UI guide will generate proper pyspark code and submit it to the Livy/Spark. In terms of new development, we may need following things.
For future considerations, we'll cover some enhancement as follows.
Reference
Following ML engines considered for the next move in df
Other References
ML SQL Reference
There is proposed ML SQL reference ease of machine learning pipeline creation
Example