apache / dolphinscheduler

Apache DolphinScheduler is the modern data orchestration platform. Agile to create high performance workflow with low-code
https://dolphinscheduler.apache.org/
Apache License 2.0
12.47k stars 4.51k forks source link

[Discussion][Machine Learning] Support AI task and the open source project about MLops #9725

Closed jieguangzhou closed 2 years ago

jieguangzhou commented 2 years ago

Search before asking

Description

I have seen a Machine Learning Platform post on Medium. The post talk about Lizhi Machine Learning Platform&Apache DolphinScheduler. https://medium.com/@DolphinScheduler/a-formidable-combination-of-lizhi-machine-learning-platform-dolphinscheduler-creates-new-paradigm-e445938f1af

Like this, I try to do something like this. MLflow, sklearn, LightGBM, Xgboost, and DolphinScheduler are used. Figure 1 shows the training workflow startup screen

image

In this workflow, I implemented four algorithms (SVM, LR, LGBM, XGboost) using the API of Sklearn, Lightgbm, and Xgboost. Every algorithm's parameters can fill in the value of key "params". In this case, the parameters of LGBM is "n_estimators=200;num_leaves=20".

The experiment tracking module is supported by MLFlow.The picture below shows the report of the experiment. image I register the model every time I run it. image

When the model is trained, run the deployment workflow. Like this:

image

We can deploy the version 2 model to the k8s cluster.

And then we can see the deployment and pods image

At the same time, we can access the service through the interface. image

BTW, we can also connect the training workflow with the deployment workflow as a sub-workflow, like this.

image

The training workflow contains one task. The code is as follows

data_path=${data_path}
export MLFLOW_TRACKING_URI=${MLFLOW_TRACKING_URI}
echo $data_path
repo=https://github.com/jieguangzhou/mlflow_sklearn_gallery.git
mlflow run $repo -P algorithm=${algorithm} -P data_path=$data_path -P params="${params}" -P param_file=${param_file} -P model_name=${model_name} --experiment-name=${experiment_name}

echo "training finish"

The deployment workflow contains two task.

image

The code of the "build docker" workflow is as follows

eval $(minikub -p minikube docker-env)
export MLFLOW_TRACKING_URI=${MLFLOW_TRACKING_URI}
image_name=mlflow/${model_name}:${version}
echo $image_name
mlflow models build-docker -m "models:/${model_name}/${version}" -n $image_name --enable-mlserver

The code of the "create deployment" workflow which deploys the model to the k8s cluster is as follows

version_lower=$(echo "${version}" | tr '[:upper:]' '[:lower:]')
kubectl apply -f - << END
apiVersion: apps/v1
kind: Deployment
metadata:
  name: mlflow-${model_name}-$version_lower
spec:
  selector:
    matchLabels:
      app: mlflow
  replicas: 3 # tells deployment to run 2 pods matching the template
  template:
    metadata:
      labels:
        app: mlflow
    spec:
      containers:
      - name: mlflow-iris
        image: mlflow/${model_name}:${version}
        imagePullPolicy: IfNotPresent
        ports:
        - containerPort: 8080

---
apiVersion: v1
kind: Service
metadata:
  name: mlflow-${model_name}-$version_lower
spec:
  ports:
  - port: 8080
    targetPort: 8080
  selector:
    app: mlflow
END

sleep 5s

kubectl port-forward deployment/mlflow-${model_name}-$version_lower ${deployment_port}:8080

The above workflow is based on the Shell task. But it is too complex to ml engineer. I hope to write new types of tasks that make them easier for users to use.

Future work:

Use case

No response

Related issues

No response

Are you willing to submit a PR?

Code of Conduct

github-actions[bot] commented 2 years ago

Thank you for your feedback, we have received your issue, Please wait patiently for a reply.

jieguangzhou commented 2 years ago

Is somebody interested in the AI task about DS?

jieguangzhou commented 2 years ago

I add the GridSearch feature to the example like this. That we can search for the best parameters of a model.

image

After the model was trained, we also can see the parameter search report on the MLflow dashboard like this

image

The value of key "search_params" above is "max_depth=[5, 10];n_estimators=[100, 200]" for xgboost.

We also can search more params, all of the params we can see In API of the algorithm, for example: svm: "kernel=['linear', 'poly', 'rbf'];C=[0.5, 1.0]" ; lr: "penalty=['l1', 'l2'];C=[0.5, 1.0]" ; lightgbm: max_depth=[5, 10];n_estimators=[100, 200]

jieguangzhou commented 2 years ago

The above work is trying to build an MLops system using dolphinscheduler as an orchestration system. I think that will be cool if we add more and more popular machine learning tools in dolphinscheduler.

Superskyyy commented 2 years ago

Hi, we are planning to initiate our exploration of AIOps at Apache SkyWalking community. Very interesting to see the discussions here.

Now I'm also looking at Dolphinscheduler to handle our workflow orchestration, and for now, we may as well go in the same direction as yours. I feel like integration with MLFlow package functionalities will be a good point to boost ML-developer experience to the next level.

jieguangzhou commented 2 years ago

Hi, we are planning to initiate our exploration of AIOps at Apache SkyWalking community. Very interesting to see the discussions here.

Now I'm also looking at Dolphinscheduler to handle our workflow orchestration, and for now, we may as well go in the same direction as yours. I feel like integration with MLFlow package functionalities will be a good point to boost ML-developer experience to the next level.

Hi, good to see you join the discussion. I just read about your discussion (https://github.com/apache/skywalking/discussions/8883). It should be a great project.

I think DolphinScheduler will be able to schedule the AIops scenario in the near future. I am enriching its scheduling features in the field of artificial intelligence, and the MVP product is being implemented.

We can keep talking about that. BTW, I might do some experiments with this data set, but I can't access it right now. https://github.com/CloudWise-OpenSource/GAIA-DataSet

zhongjiajie commented 2 years ago

Looking great! I am very optimistic about the prospects of this. And as I said in the mail thread, I think machine learning is also another kind of orchestration, and most of the machine learning source data or training samples are from data warehouses or data lakes, which we already supported in the current version. If we DolphinScheduler could support machine learning tasks then users could finish their jobs in one single tool instead of separately.

Superskyyy commented 2 years ago

I'll be happy to follow this and provide help. Also happy to integrate and test the outcomes in the new SkyWalking ecosystem AIOps project.