Open mihow opened 3 months ago
Controller can be tested here: https://preview.ami.ecoscience.dk/swagger-ui/index.html#/pipeline-controller/createRequest
Example request:
{
"projectId": "ecos",
"jobId": "ea12ac70-288c-11ef-9ca5-00155d926c42",
"sourceImages": [
{
"id": "NScxODE3NzEyMwo=",
"url": "https://anon.erda.au.dk/share_redirect/DSWDMAO70L/ias/denmark/DK1/2023_07_05/20230705000135-00-07.jpg",
"eventId": "1234"
}
],
"pipelineConfig": {
"stages": [
{
"stage": "OBJECT_DETECTION",
"stageImplementation": "flatbug"
},
{
"stage": "CLASSIFICATION",
"stageImplementation": "mcc24"
}
]
},
"callback": {
"callbackUrl": "http://127.0.0.1:8080/example/callback",
"callbackToken": "1234"
}
}
The callback can be inspected using Ngrok locally:
ngrok http 2222
callbackUrl
in the sample request to the generated ngrok URLImage that causes a truncated request https://static.dev.insectai.org/ami-trapdata/Panama/E43B615A/20231113004009-snapshot.jpg
There are a few Refactorings I'm considering in the PipelineController.
Stage
in the PipelineConfig is currently unused by the controller
. I originally included it because I thought different stages might be handled differently but I think the current solution is cleaner - every stage use the same interface. One problem with handling different stages differently is that one stage implementation may perform multiple operations on the image.ImageCrop
- Originally I thought the pipeline as ObjectDetection
-> ImageCrop
-> Classification
. But I realized that we don't need to crop in order to do classification. Actually it performs a lot better if the entire sourceimage is used with bounding boxes. So maybe cropping is post-pipeline step? It can also be added as a stage and set the cropUrl
on the detection.@kaviecos Have you considered adding a status endpoint on the controller? It would be nice if a callback is missed, or if the job is taking a long time, for the client to be able to request the status of a request. PROCESSING, FAILED, WAITING, 2/6 complete, etc.
Also will you document the types of failures and what those will look like in a callback?
I just added these as subtasks as well.
NOTES from call with @mihow and @kaviecos 2024-08-20
docs for OpenAI https://platform.openai.com/docs/guides/batch/getting-started
@kaviecos Let's also keep in mind that scenario where we have stage implementations running behind a firewall and cannot have a publicly accessible endpoint. Our major providers (Compute Canada and the UK's JASMIN) both have way more compute available to us, but not on persistent VMs like we are using now. For a big job like Biodiversa+'s data, we can request a bunch of GPUs and provide the docker container to run via a SLURM job scheduler. Each job can access the internet, so can pull from the queue and send back results. But they won't have a publicly accessible endpoint (unless we can use a tunnel?)
@kaviecos Let's also keep in mind that scenario where we have stage implementations running behind a firewall and cannot have a publicly accessible endpoint. Our major providers (Compute Canada and the UK's JASMIN) both have way more compute available to us, but not on persistent VMs like we are using now. For a big job like Biodiversa+'s data, we can request a bunch of GPUs and provide the docker container to run via a SLURM job scheduler. Each job can access the internet, so can pull from the queue and send back results. But they won't have a publicly accessible endpoint (unless we can use a tunnel?)
@mihow The current implementation actually allows for consumers hosted on other servers. The requirements are that they can make an outbound connection to the RabbitMQ server (port 5672). Then all communication with the controller can go through RabbitMQ. Of course they also need to be able to access the source images.
When it comes to monitoring this also means that we need to monitor both the consumers and the stage implementations. And if we cannot make inbound http requests then we need to think that into the monitoring solution. Maybe a push-based solution (last_seen as you mentioned). It would be nice to know more about the restrictions - like, is it even possible to connect to RabbitMQ?
@kaviecos and @mihow have designed & written the specifications for a new ML backend that orchestrates multiple types of models by different research teams, across multiple stages of processing, and is horizontally scalable. This expands on the current ML backend API defined here https://ml.dev.insectai.org/ by adding asynchronous processing, a controller & queue system, auth and many other production features.
The initial spec and notes are here, but are being re-written in the Aarhus GitLab wiki as the backend is developed. https://docs.google.com/document/d/1caKxxfZhWhRi9Jfv9fy5fVeoM9bvhYPJ/
Docs in progress: https://gitlab.au.dk/ecos/hoyelab/ami/ami-pipeline-controller/-/wikis/Getting-Started https://gitlab.au.dk/ecos/hoyelab/ami/ami-pipeline-controller/-/wikis/Pipeline-Stages https://gitlab.au.dk/ecos/hoyelab/ami/ami-pipeline-controller/-/wikis/Architecture-Overview
Known remaining tasks: