New async & distributed ML backend #515

Open mihow opened 4 weeks ago

@kaviecos and @mihow have designed & written the specifications for a new ML backend that orchestrates multiple types of models by different research teams, across multiple stages of processing, and is horizontally scalable. This expands on the current ML backend API defined here https://ml.dev.insectai.org/ by adding asynchronous processing, a controller & queue system, auth and many other production features.

The initial spec and notes are here, but are being re-written in the Aarhus GitLab wiki as the backend is developed. https://docs.google.com/document/d/1caKxxfZhWhRi9Jfv9fy5fVeoM9bvhYPJ/

Docs in progress: https://gitlab.au.dk/ecos/hoyelab/ami/ami-pipeline-controller/-/wikis/Getting-Started https://gitlab.au.dk/ecos/hoyelab/ami/ami-pipeline-controller/-/wikis/Pipeline-Stages https://gitlab.au.dk/ecos/hoyelab/ami/ami-pipeline-controller/-/wikis/Architecture-Overview

pipeline_controller_flow drawio

Known remaining tasks:

[x] #516
[ ] Add more validation responses instead of 500 errors (specifically for the model APIs)
[ ] Add docs for the data flow between stages
[x] #517
[x] Install NewRelic infrastructure monitoring on all servers (get token from @mihow)
[ ] Consider installing NewRelic into actual codebase
[ ] Test initiating requests from AMI application (when #505 is complete)
[ ] Implement re-processing of detections
[x] #540
[ ] Document possible error responses
[x] CNN features are stored to a generic "URI" field (could reference a remote storage, or s3:// or vector database or file)
[ ] Add dummy/mockup detector that returns many detections, and no detections
[ ] Consider seeing how many large requests the controller can take and send back (to callback) (load test)
[ ] #539
[ ] For detectionRequests: Change single source image to array of images, even though we are using one
[x] Add "stageParams": {} to the controller request, which are then passed via URL to the stage implementation requests.
[x] Update the registry to support a "*" projectID that any project can use
[ ] Think about an endpoint for checking what stages are registered. Use the projectID to determine if it's a private or shared stage implementation.
[ ] #538
[ ] Implement automatic retries when a task fails.

Controller can be tested here: https://preview.ami.ecoscience.dk/swagger-ui/index.html#/pipeline-controller/createRequest

Example request:

{
"projectId": "ecos",
"jobId": "ea12ac70-288c-11ef-9ca5-00155d926c42",
"sourceImages": [
{
"id": "NScxODE3NzEyMwo=",
"url": "https://anon.erda.au.dk/share_redirect/DSWDMAO70L/ias/denmark/DK1/2023_07_05/20230705000135-00-07.jpg",
"eventId": "1234"
}
],
"pipelineConfig": {
"stages": [
{
"stage": "OBJECT_DETECTION",
"stageImplementation": "flatbug"
},
{
"stage": "CLASSIFICATION",
"stageImplementation": "mcc24"
}
]
},
"callback": {
"callbackUrl": "http://127.0.0.1:8080/example/callback",
"callbackToken": "1234"
}
}

The callback can be inspected using Ngrok locally:

run ngrok http 2222
update the callbackUrl in the sample request to the generated ngrok URL
open web interface http://localhost:4040
trigger the request to the ML API controller

Image that causes a truncated request https://static.dev.insectai.org/ami-trapdata/Panama/E43B615A/20231113004009-snapshot.jpg

There are a few Refactorings I'm considering in the PipelineController.

Version on Detections in the callback is not set. The reason is that it's not obvious which value to set it to. The bounding box is done by flatbug, each classification is done by another classifier and cnn-features and crops might be from other stages.
Stage in the PipelineConfig is currently unused by the controller. I originally included it because I thought different stages might be handled differently but I think the current solution is cleaner - every stage use the same interface. One problem with handling different stages differently is that one stage implementation may perform multiple operations on the image.
ImageCrop - Originally I thought the pipeline as ObjectDetection -> ImageCrop -> Classification. But I realized that we don't need to crop in order to do classification. Actually it performs a lot better if the entire sourceimage is used with bounding boxes. So maybe cropping is post-pipeline step? It can also be added as a stage and set the cropUrl on the detection.

@kaviecos Have you considered adding a status endpoint on the controller? It would be nice if a callback is missed, or if the job is taking a long time, for the client to be able to request the status of a request. PROCESSING, FAILED, WAITING, 2/6 complete, etc.

Also will you document the types of failures and what those will look like in a callback?

I just added these as subtasks as well.

NOTES from call with @mihow and @kaviecos 2024-08-20

CNN features are stored to a generic "URI" field (could reference a remote storage, or s3:// or vector database or file)
Add tests for dummy detector that returns many detections, and no detections - in the controller repo but also a good idea in the AMI Platform to add tests with mock responses.
Consider seeing how many 7mb requests the controller can take and send back (to callback) (load test)
Michael consider attempting a dev setup in one compose file for local dev & testing of the AMI Platform (single stage everything & multi stage setups)
If fetching the same image multiple times is a problem, then our current recommendation is to put a proxy cache in front of the local network where the stages are running.
Consider a cancellation method
Change single source image to array of images, even though we are using one
Add "stageParams": {} to the controller request, which are then passed via URL to the stage implementation requests.
Update the registry to support a "*" projectID that any project can use
Think about an endpoint for checking what stages are registered. Use the projectID to determine if it's a private or shared stage implementation.
Think about a way to see if a stage implementation is online or when it was last seen - Add some sort of basic healthcheck endpoint (see kubernetes /livez & /readyz pattern)? or can we check when a message was last taken from the queue by a stage?

docs for OpenAI https://platform.openai.com/docs/guides/batch/getting-started

@kaviecos Let's also keep in mind that scenario where we have stage implementations running behind a firewall and cannot have a publicly accessible endpoint. Our major providers (Compute Canada and the UK's JASMIN) both have way more compute available to us, but not on persistent VMs like we are using now. For a big job like Biodiversa+'s data, we can request a bunch of GPUs and provide the docker container to run via a SLURM job scheduler. Each job can access the internet, so can pull from the queue and send back results. But they won't have a publicly accessible endpoint (unless we can use a tunnel?)

@kaviecos Let's also keep in mind that scenario where we have stage implementations running behind a firewall and cannot have a publicly accessible endpoint. Our major providers (Compute Canada and the UK's JASMIN) both have way more compute available to us, but not on persistent VMs like we are using now. For a big job like Biodiversa+'s data, we can request a bunch of GPUs and provide the docker container to run via a SLURM job scheduler. Each job can access the internet, so can pull from the queue and send back results. But they won't have a publicly accessible endpoint (unless we can use a tunnel?)

@mihow The current implementation actually allows for consumers hosted on other servers. The requirements are that they can make an outbound connection to the RabbitMQ server (port 5672). Then all communication with the controller can go through RabbitMQ. Of course they also need to be able to access the source images.

When it comes to monitoring this also means that we need to monitor both the consumers and the stage implementations. And if we cannot make inbound http requests then we need to think that into the monitoring solution. Maybe a push-based solution (last_seen as you mentioned). It would be nice to know more about the restrictions - like, is it even possible to connect to RabbitMQ?

RolnickLab / ami-platform

New async & distributed ML backend #515