U-Alberta / ADaPT-ML

MIT License
6 stars 2 forks source link

status

ADaPT-ML

A Data Programming Template for Machine Learning

Often when studying natural phenomena by creating data-driven models, processing the data becomes the largest challenge. Without a framework to build upon and implement one's ideas, researchers are forced to hastily build inflexible programs from the ground up. When hypotheses need to be reworked or modelling a new aspect of the phenomena becomes necessary, even more time is spent on the program before finally being able to test out new ideas. This inherently causes problems, with additional problems arising such as including internal and external validation steps as an afterthought rather than a checkstop in the pipeline.

ADaPT-ML aims to be the flexible framework upon which researchers can implement their understanding of the phenomena under study. This software was created especially for any researcher with:

ADaPT-ML takes as much of the development work as possible out of creating novel models of phenomenon for which we have well-developed theories that have yet to be applied to big data.

Introduction

ADaPT-ML is composed of a number of open-source tools and libraries, as shown in this system diagram. To familiarize yourself with these components, please review the tools and libraries linked to below the diagram.

System Diagram

Now that you are familiar with the concepts, terminology, and tools that make up ADaPT-ML, let's look at the example use case included in this repository. Once you have an understanding of how ADaPT-ML works and want to get started with your own use case, please refer to these instructions for testing ADaPT-ML on your machine, and the usage guidelines, including how to contribute to this project.

Example Usage

Our Example Use Case is to develop a model that can predict whether a data point is about a cat, dog, bird, horse, or snake. Although intuitively this is purely a multilabel task where it is reasonable to assume that one or more animals could be mentioned in one datapoint, this task has been divided into a multiclass setting, where there is only one possible class that the data point can belong to, and a multilabel setting, where one data point can belong to one or many classes, to demonstrate how to handle both tasks (it is not necessary for you to also divide your new classification task into multiclass and multilabel settings).

All of the directories and files mentioned in the following steps exist in the locations specified in the .env file of this repository. To follow along using the various UIs, complete Step 1 and these tests to get ADaPT-ML running on your host machine, and go to the following addresses in your web browser of choice:

  1. localhost:4200 for CrateDB
  2. localhost:8080 for Label Studio
  3. localhost:5000 for data programming MLflow
  4. localhost:5001 for modelling MLflow
  5. localhost:81/docs for FastAPI

Step 1: obtain and featurize some data

We do not have an existing annotated dataset for this classification task, so the first step will be to create one. When you first get started, you will need to gather the appropriate data for your task, and featurize it in two ways:

  1. Decide on which features you would pull out to assist in annotation if you were going to manually assign classes to the datapoints.
  2. Decide on how you would represent the datapoints as feature vectors for the End Model. Again, to keep it simple for this use case, our first feature set is simply lemmatized tokens. Our second feature set is the output from the Universal Sentence Encoder, given the raw text as input.

In this use case, data points were manually created with only a text component to keep it simple, but consider the tweets 1a-1e in the diagram below.

Step One

Many of them have both text and images that can provide information for more accurate classification. Let's run through each datapoint:

This diagram demonstrates the process of setting up the example use case data in a table in CrateDB so that it is ready for ADaPT-ML. You can refer to this script to see how this was accomplished in detail. As long as each table has these essential columns, you can combine multiple tables to create your training and testing data:

Step 2: create a gold dataset using Label Studio

We are now ready to annotate a sample of data in Label Studio! Because we only have a total of 15 datapoints for the multiclass setting and 15 for the multilabel setting, they were all annotated manually, but in a real-world application of this classification task, it is likely we would have hundreds of thousands of datapoints. In this case, we would instruct two or more annotators to manually label a few hundred datapoints for a few purposes:

  1. Gather feedback from the annotators to inform how we can update or create new Labeling Functions
  2. Estimate the class balance and make it available to the Label Model during training
  3. Perform an empirical evaluation of the Labeling Functions and Label Model
  4. Validate the End Model

The first step to annotate data using Label Studio is to set up the project using the Label Studio UI. For this example use case, we enter localhost:8080 (if you changed the port in docker-compose.yml, replace 8080 with what you entered) in a web browser. Create an account, and set up the project (we simply called it "example").

The second step is to sample some data from CrateDB. The sampling method implemented currently in ADaPT-ML is a random N, so this commannd was used to sample all 30 datapoints for the multiclass and multilabel settings:

docker exec label-studio-dev python ./ls/sample_tasks.py example_data txt 30 example --filename example_tasks.json

This module will format the data in the column names provided so that it can be read by Label Studio, and save a file in the $LS_TASKS_PATH directory. The diagram below shows the process of using Label Studio to import the sampled data, annotate it, and export it.

Step Two

Now that we have labeled all of our sample data and exported the results, we need to process the JSON file back into the Pandas DataFrames that ADaPT-ML can use. Because we had multiple annotators label each datapoint, we need to decide how we want to compile these labels into one gold label set. These two tasks are accomplished through this command:

docker exec label-studio-dev python ./ls/process_annotations.py example_annotations.json example 1

The following DataFrames are saved in $LS_ANNOTATIONS_PATH/example:

docker exec label-studio-dev python ./ls/annotator_agreement.py example

This module uses task_df.pkl to calculate Krippendorff's alpha. For demonstration purposes, worker_2 intentionally disagreed with worker_1 on several datapoints, where worker_1 made all correct choices. Between worker_1 and worker_2, the agreement report looks like this:

TASK: example
NOMINAL ALPHA: 0.43847133757961776
RESULT: 0.43847133757961776 < 0.667. Discard these annotations and start again. 

This would normally prompt an iteration on the labelling process, but we are choosing only worker_1's labels for the gold dataset.

Step 3: use data programming to create a labeled dataset

Now that we have our gold labels, we are ready to perform data programming to label more data. We have followed these instructions to modify ADaPT-ML for our example classification task. We also sampled some data from CrateDB that we want to use as training data; for this example use case, we have one DataFrame with the multiclass datapoints and one DataFrame with the multilabel datapoints, and both DataFrames only have the columns id and table_name. The DataFrames are called multiclass_df.pkl and multilabel_df.pkl, and both are stored in $DP_DATA_PATH/unlabeled_data.

Once we run the commands in the following code block...

docker exec dp-mlflow sh -c ". ~/.bashrc && wait-for-it dp-mlflow-db:3306 -s -- mlflow run --no-conda -e example --experiment-name eg -P train_data=/unlabeled_data/multiclass_df.pkl -P dev_data=1 -P task=multiclass -P seed=8 ."

docker exec dp-mlflow sh -c ". ~/.bashrc && wait-for-it dp-mlflow-db:3306 -s -- mlflow run --no-conda -e example --experiment-name eg -P train_data=/unlabeled_data/multilabel_df.pkl -P dev_data=1 -P task=multilabel -P seed=8 ."

...we can check out the results using the MLflow UI, as seen in the diagram below.

Step 3

Once we have experimented with the Label Model parameters, Labeling Functions, and datasets to our satisfaction, we can make note of the experiment ID (EXP_ID) and run ID (RUN_ID) to access the training_data.pkl and development_data.pkl that we want to use in End Model training and evaluation. For ease of demonstration, these artifacts have been placed in ${DP_DATA_PATH}/mlruns, but normally these artifacts would be found in ${DP_DATA_PATH}/mlruns/EXP_ID/RUN_ID/artifacts.

Step 4: create an End Model

Now that we have our training data labeled by the Label Model and testing data with gold labels, we can create an End Model that, given a DataFrame containing only the id and table_name columns, will look up the appropriate features for each datapoint in CrateDB and produce a DataFrame with a binary encoding for the predicted class(es), and the probability distribution over all classes. Currently, ADaPT-ML only has one machine learning algorithm, scikit-learn's Multi-layer Perceptron (MLP), a classifier that optimizes the log-loss function using LBFGS or stochastic gradient descent.

After running the commands in this code block...

docker exec modelling-mlflow sh -c ". ~/.bashrc && wait-for-it modelling-mlflow-db:3306 -s -- mlflow run --no-conda -e mlp --experiment-name eg -P train_data=/dp_mlruns/multiclass_training_data.pkl -P test_data=/dp_mlruns/multiclass_development_data.pkl -P features=txt_use -P solver=lbfgs -P random_state=8 ."

docker exec modelling-mlflow sh -c ". ~/.bashrc && wait-for-it modelling-mlflow-db:3306 -s -- mlflow run --no-conda -e mlp --experiment-name eg -P train_data=/dp_mlruns/multilabel_training_data.pkl -P test_data=/dp_mlruns/multilabel_development_data.pkl -P features=txt_use -P solver=lbfgs -P random_state=8 ."

...we can check out the results in MLflow, as shown in the diagram below.

Step 4

Once we have experimented with the MLP parameters, and possibly iterated more on the data programming step if necessary, we can prepare our models for deployment by simply updating the model environment variables in .env and the environment section of the m_deploy service in docker-compose.yml to point to python_model.pkl. For this example use case, multiclass and multilabel models were copied and renamed to ${MODELLING_DATA_PATH}/mlruns/multiclass_model.pkl and ${MODELLING_DATA_PATH}/mlruns/multilabel_model.pkl.

Step 5: deploy the End Model

This diagram shows the FastAPI UI for the deployed models.

Step 5

Now we can get multiclass predictions...

curl -X 'POST' \
  'http://localhost/predict_multiclass_example' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "table_name": [
    "example_data"
  ],
  "id": [
    "20"
  ]
}'
{
  "table_name": [
    "example_data"
  ],
  "id": [
    "20"
  ],
  "cat": [
    0
  ],
  "dog": [
    0
  ],
  "bird": [
    0
  ],
  "horse": [
    0
  ],
  "snake": [
    1
  ],
  "prob_cat": [
    5.6850715594352195e-8
  ],
  "prob_dog": [
    0.0001963686969921083
  ],
  "prob_bird": [
    8.922841061481865e-8
  ],
  "prob_horse": [
    8.82467128837139e-9
  ],
  "prob_snake": [
    0.9998034763992105
  ]
}

...and multilabel predictions...

curl -X 'POST' \
  'http://localhost/predict_multilabel_example' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "table_name": [
    "example_data"
  ],
  "id": [
    "03"
  ]
}'
{
  "table_name": [
    "example_data"
  ],
  "id": [
    "03"
  ],
  "cat": [
    1
  ],
  "dog": [
    1
  ],
  "bird": [
    0
  ],
  "horse": [
    0
  ],
  "snake": [
    0
  ],
  "prob_cat": [
    0.999976879069893
  ],
  "prob_dog": [
    0.9999725147168369
  ],
  "prob_bird": [
    2.061596293323691e-8
  ],
  "prob_horse": [
    1.7205732529738035e-7
  ],
  "prob_snake": [
    2.0265644234853424e-8
  ]
}

...for any datapoint that has the txt_use feature set in CrateDB. We have successfully created a model for our new example use case!