datafibers-community / df_data_service

DataFibers Data Service
http://www.datafibers.com
Apache License 2.0
31 stars 30 forks source link

DF Machine Learning Features Design Specification #169

Open datafibers opened 6 years ago

datafibers commented 6 years ago

UI Desing

Transform View Changes

In this view, you will be able to see to more features supporting machine learning jobs under [Add] in this view, model over stream and model training/persist.

Model over Stream (MOS)

This is to deploy trained the model on stream using Spark/Flink/Kafka streaming. In this case, we need to give following parameters, such as

Once deployed, you can see this job in the [Transform] main view with the specific category.

Model Training/Persist (MTP)

This is to launch a job to train our target models and persist it. A detailed guideline (or customized without guide) will be provided for following factors, such as

Once the job is submitted, you'll see a job in the [Transform] main view with specific job type/category. The training job can be resubmitted with different input until you satisfy the model and would like to persist it. If the persist option is chosen, this trained model is available in the [Model] view immediately.

New Model View

This is to create a new view of Model in UI. In this view, we are able to do following things.

Back-end Design

We'll mainly leverage Apache Spark MLLib DataFrame-based API (not RDD based) for this feature. The UI guide will generate proper pyspark code and submit it to the Livy/Spark. In terms of new development, we may need following things.

For future considerations, we'll cover some enhancement as follows.

Reference

Following ML engines considered for the next move in df

Other References

  1. Apply Flink ML to streaming

ML SQL Reference

There is proposed ML SQL reference ease of machine learning pipeline creation

train [hive table/view name] as [ML algorithm] at [mode path]
where [algorithm parameter]
using [feature set]

Example

train hive_book_transact as DecisionTreeClassifier at '/tmp/mode/dt_001/'
where label_col='label' and feature_col='features'
using 
StringIndexer(fx_rate, fx_rate_indexed) and 
StringIndexer(profit, label) and 
VectorAssembler(feature_array, features)
datafibers commented 6 years ago

UI Preview fireshot capture 1 - datafibers - http___localhost_3000_ _tr_create