TobiasRoeddiger commented 2 years ago

We want to have a first full pipeline as soon as possible. Therefore, we need to come up with a strategy that can efficiently and iterativley contribute towards that goal. Please feel free to share your thoughts and suggestions below.

Next Sprint (~ 8 weeks)

data acces from python: we need a small library to retrieve data from edge-ml into python for model training and validation. Ideally, we could to that based on the API token that we also already use to push data to the cloud from the device libraries. I think it makes sense to have a read and a write token in the future but for now we should just have one token to keep it simple stupid. I suggest we always generate a token for a project and the user does not have to activate it specifically (as they have to at the moment). Currently, I think @ilteen would be the ideal person to look into it. as python is excellent with parsing JSON objects it should be as simple as returning the project based on the API Token and parsing it to a Python object + some small frontend and backend changes.
backend abstractions: we will have to carefully plan the architecture for the machine learning backend. The main challenge is to have an API that can be consistent between frameworks. We will have to define requirements first and then iterate based on it. @tk-king will adress this as part of his master thesis. I think how we can translate hyperparemter settings to the frontend depending on each framework will be the main challenge. Yexu is also working on an AutoML project with minimal setup requirements [edge-ml Pro version incoming? Your configured model achieves 74% accuracy but use edge-ml Pro to get 99.9%].
simple classifier there is a very good tutorial on how to generate models using tensorflow lite available here. Obviously the model is very basic but we should think about how far we can reuse what's already there for the first version.

Next Next Sprint (~ 16 weeks)

neural architecture search Yexu already has a tool for it and we will have to see how we can integrate that in our process. From what I have heard it has its main focus on image data but let's see.
data pre-processing in many cases it makes sense to pre-process the data but it comes with many challenges with regards to how we can do that flexibly.
feature extraction: I tried to find a feature extraction library for Arduino (didn't look for it too long) but couldn't find one. I talked to @riedel about this and we are thinking about running a Hackathon to build an initial feature extraction library that we can use on Arduino and also call from python on edge-ml. This way we could retain the logic accross platforms.

Next Next Next Sprint (~ 24 weeks)

deployment we want to deploy the models trained in the cloud back to the embedded device

Other Stuff

bugs we still have more than enough of them, let's make sure to always create issues so we don't forget them
testing we currently have quite a bit of technical debt with regards to writing tests (especially in the frontend). We will have some more people coming in so hopefully we can address this soon.
library we should make the python library mentioned above available publicly so people can use the collected data easily in, e..g, a notebook

TobiasRoeddiger commented 2 years ago

NOTE

Currently this does not consider what we already have implemented here. I think we can learn from it and get some inspiration four our architecture.

Architecture Thoughts

Features we need:

select datasets to base the model on
select labels to classify from the data
pre-processing steps?
select features to extract in every window
select model
set hyperparemeters for the models
- window size, step size (this should be done in an abstraction class that is used by every concrete model)
- train / test split (also in the abstraction model)
- model specific tuning parameters
- balance classes?
train model
store model and its performance metrics, and also the configured parameters

Internal Structure

abstract class EdgeModel

+ abstract get_hyperparameters();
+ abstract fit(X, y, hyperparameters);
+ abstract predict(X);
+ abstract compileFirmware(targetPlatform); // generates the binary for the target platform

- window_data(X, y, width, stride); // will be called before fit, returns tuple of data and labels

class RandomForest extends EdgeModel

+ get_hyperparameters(); // call super and add own hyperparameters
+ fit(X, y, hyperparameters);
+ predict(X);
+ compileFirmware(targetPlatform);

{
  param1: {
  param2: {
    type: 'selection',
    options: ['full', 'deep', 'half'],
    required: true, // this parameters has to be selected
    multiSelect: false // determines if multiple items can be selected
  },
  param3: {
    type: "number", // select value by typing a number
    min: 0,
    max: 5,
    inclusiveMin: true, // can min and max be selected
    inclusiveMax: false,
    precision: 'int',  // float also possible,
    required: false
  },
  param4: {
    type: "boolean", // can only be true or false,
    required: false
  }
}

REST API

GET - /model

returns the list of available models that can be trained

Each model has the following information:

{
  id: "sdfasdfasf34f334f34f",
  name: "RandomForest",
  hyperparameters: META_JSON_FORMATTED_HYPERPARAMETERS, // can be parsed by the frontend to show configuration UI
  isPro: false, // determines if the model can only be used by Pro users
}

POST - /model/:id

trains the model based on the given parameters

accepts

{
  project: "sdfoi3hf09whf0923hd", // project id for which the model is created
  datasets: ["sadfsdf4f34f", "sdaf34g45g9j45g", ....], // list of datasets ids to use for model training and testing
  labels: ["4456432f44v45ff", "34f234f425fmm4"], // label ids to consider for the classification task
  hyperparameters: { ... }, // as specified from the meta format
}

returns

GET - /model/trained

returns the list of available models that were trained

list of trained models

GET - /model/trained/:id

returns a specific model that was trained

{
  modelId: "sdfasdfasf34f334f34f", // id of the model used for training
  hyperparameters: { ... }, // as specified before training
  confusionMatrix: { ... }, // the confusion matrix of all classes
  size: 140.3, // the size mof the model in kB
}

riedel commented 2 years ago

I talked to @riedel about this and we are thinking about running a Hackathon to build an initial feature extraction library that we can use on Arduino and also call from python on edge-ml. This way we could retain the logic accross platforms.

One more idea for discussion: We could maybe even make this hackathon open to externals. I would actually make the hackathon with a broader scope. Including also automl, etc.

TobiasRoeddiger commented 2 years ago

I talked to @riedel about this and we are thinking about running a Hackathon to build an initial feature extraction library that we can use on Arduino and also call from python on edge-ml. This way we could retain the logic accross platforms.

One more idea for discussion: We could maybe even make this hackathon open to externals. I would actually make the hackathon with a broader scope. Including also automl, etc.

Yes. That would be really nice once we have the meta architecture ready.

riedel commented 2 years ago

Yes. That would be really nice once we have the meta architecture ready.

Maybe it makes sense to do this in a 2 step aproach then:

internal hackathon: with prototypes without fixed meta arch
broader hackathon taking "architecturized" examples from 1st hackathon as basis and examples

TobiasRoeddiger commented 2 years ago

Yes good idea, the python library to pull data is almost ready. For the internal hackathon we would have different people build different models from the data that they pull using the python library. This way we could better understand the requirements for the meta arch and could also validate that collecting data and labeling works as expected.

How relevant will porting to the edge devic be for the hackathon?

TobiasRoeddiger commented 2 years ago

Latest update:

We have a very basic version of a feature extraction library written in C
we have a basic notebook that shows training a model from edge-ml data is well possible
before a user can train a model some pre selections have to be made
- which sensors (given by the name) are relevant (e.g. this will discard sensors that are irrelevant and also exclude datasets that do not have all required sensors). we would have to pull that in on the frontend and need a dedicated endpoint for this in the current backend, the information can then be pushed to the ml component
- which labeling to target? (a model will always only address one labeling)

@KtrauM maybe we can add a sensor filter to the notebook, too? This would avoid some bugs (e.g., at the start define ACC_x etc. as target sensors) Otherwise what happens if we have a project with different sensors in some datasets? Then we could just drop useless datasets that don't have all sensors and ignore useless "non-target" sensor streams.

edge-ml / ml

Gameplan #1

Next Sprint (~ 8 weeks)

Next Next Sprint (~ 16 weeks)

Next Next Next Sprint (~ 24 weeks)

Other Stuff

NOTE

Architecture Thoughts

Internal Structure

REST API

GET - /model

POST - /model/:id

accepts

returns

GET - /model/trained

GET - /model/trained/:id