RolnickLab / ami-platform

GNU General Public License v3.0
8 stars 3 forks source link

New async & distributed ML backend #515

Open mihow opened 4 weeks ago

mihow commented 4 weeks ago

@kaviecos and @mihow have designed & written the specifications for a new ML backend that orchestrates multiple types of models by different research teams, across multiple stages of processing, and is horizontally scalable. This expands on the current ML backend API defined here https://ml.dev.insectai.org/ by adding asynchronous processing, a controller & queue system, auth and many other production features.

The initial spec and notes are here, but are being re-written in the Aarhus GitLab wiki as the backend is developed. https://docs.google.com/document/d/1caKxxfZhWhRi9Jfv9fy5fVeoM9bvhYPJ/

Docs in progress: https://gitlab.au.dk/ecos/hoyelab/ami/ami-pipeline-controller/-/wikis/Getting-Started https://gitlab.au.dk/ecos/hoyelab/ami/ami-pipeline-controller/-/wikis/Pipeline-Stages https://gitlab.au.dk/ecos/hoyelab/ami/ami-pipeline-controller/-/wikis/Architecture-Overview

pipeline_controller_flow drawio

Known remaining tasks:

mihow commented 4 weeks ago

Controller can be tested here: https://preview.ami.ecoscience.dk/swagger-ui/index.html#/pipeline-controller/createRequest

Example request:

{
"projectId": "ecos",
"jobId": "ea12ac70-288c-11ef-9ca5-00155d926c42",
"sourceImages": [
{
"id": "NScxODE3NzEyMwo=",
"url": "https://anon.erda.au.dk/share_redirect/DSWDMAO70L/ias/denmark/DK1/2023_07_05/20230705000135-00-07.jpg",
"eventId": "1234"
}
],
"pipelineConfig": {
"stages": [
{
"stage": "OBJECT_DETECTION",
"stageImplementation": "flatbug"
},
{
"stage": "CLASSIFICATION",
"stageImplementation": "mcc24"
}
]
},
"callback": {
"callbackUrl": "http://127.0.0.1:8080/example/callback",
"callbackToken": "1234"
}
}

The callback can be inspected using Ngrok locally:

Image that causes a truncated request https://static.dev.insectai.org/ami-trapdata/Panama/E43B615A/20231113004009-snapshot.jpg

kaviecos commented 4 weeks ago

There are a few Refactorings I'm considering in the PipelineController.

  1. Version on Detections in the callback is not set. The reason is that it's not obvious which value to set it to. The bounding box is done by flatbug, each classification is done by another classifier and cnn-features and crops might be from other stages.
  2. Stage in the PipelineConfig is currently unused by the controller. I originally included it because I thought different stages might be handled differently but I think the current solution is cleaner - every stage use the same interface. One problem with handling different stages differently is that one stage implementation may perform multiple operations on the image.
  3. ImageCrop - Originally I thought the pipeline as ObjectDetection -> ImageCrop -> Classification. But I realized that we don't need to crop in order to do classification. Actually it performs a lot better if the entire sourceimage is used with bounding boxes. So maybe cropping is post-pipeline step? It can also be added as a stage and set the cropUrl on the detection.
mihow commented 3 weeks ago

@kaviecos Have you considered adding a status endpoint on the controller? It would be nice if a callback is missed, or if the job is taking a long time, for the client to be able to request the status of a request. PROCESSING, FAILED, WAITING, 2/6 complete, etc.

Also will you document the types of failures and what those will look like in a callback?

I just added these as subtasks as well.

mihow commented 3 weeks ago

NOTES from call with @mihow and @kaviecos 2024-08-20

docs for OpenAI https://platform.openai.com/docs/guides/batch/getting-started

mihow commented 3 weeks ago

@kaviecos Let's also keep in mind that scenario where we have stage implementations running behind a firewall and cannot have a publicly accessible endpoint. Our major providers (Compute Canada and the UK's JASMIN) both have way more compute available to us, but not on persistent VMs like we are using now. For a big job like Biodiversa+'s data, we can request a bunch of GPUs and provide the docker container to run via a SLURM job scheduler. Each job can access the internet, so can pull from the queue and send back results. But they won't have a publicly accessible endpoint (unless we can use a tunnel?)

kaviecos commented 3 weeks ago

@kaviecos Let's also keep in mind that scenario where we have stage implementations running behind a firewall and cannot have a publicly accessible endpoint. Our major providers (Compute Canada and the UK's JASMIN) both have way more compute available to us, but not on persistent VMs like we are using now. For a big job like Biodiversa+'s data, we can request a bunch of GPUs and provide the docker container to run via a SLURM job scheduler. Each job can access the internet, so can pull from the queue and send back results. But they won't have a publicly accessible endpoint (unless we can use a tunnel?)

@mihow The current implementation actually allows for consumers hosted on other servers. The requirements are that they can make an outbound connection to the RabbitMQ server (port 5672). Then all communication with the controller can go through RabbitMQ. Of course they also need to be able to access the source images.

When it comes to monitoring this also means that we need to monitor both the consumers and the stage implementations. And if we cannot make inbound http requests then we need to think that into the monitoring solution. Maybe a push-based solution (last_seen as you mentioned). It would be nice to know more about the restrictions - like, is it even possible to connect to RabbitMQ?