Migrate to k8s and figure out cost-effective API deployment

monkeypants commented 1 year ago

Is this about making the ML model deployments more cost-effective, or the rest of it (front-end, Flask microservice, celery, DB, redis)? If it's about the prediction endpoint...

What if we put the celery worker and prediction endpoint in the same worker (docker-composition / host), and had some lightweight supervisor process that shut it down when there was no workload (i.e. when the queue had been empty for a minute), and woke it up again when the queue had been non-empty for longer than some threshold (e.g. 3 minutes, assuming it takes less than 2 minutes to boot up). So when the supervisor woke the worker up, it would both have the prediction endpoint and the celery worker munching through the queue and getting predictions done until there were no more, then it would go to sleep. Yes I'm sure there is a K8 way to do this, but maybe an Azure native way too? Something for the cloud guys to look at...

This way, if one of us had a cheap but unreliable celery/prediction worker (i.e. on a workstation) the queue would seldom if ever be "non-empty for some time", unless things were getting really busy with users, so the supervisor would seldom need to wake up the worker and start spending real money. Unless there was some outage of the workstation, in which case (if anyone was actually using the app) the supervisor would pick up the slack.

simra commented 1 year ago

This makes sense. The two main cost drivers are redis and the prediction API which is too big an image for a B1 plan. The celery worker is already running in the same container as the flask service and all it does is receive messages and put POST requests to the API, writing the result to the SQL back-end. We could locally run the celery broker (I tried rabbitmq instead of redis but had issues), and trigger the prediction service to spin up only when we need it. I had a thought to just migrate the prediction api to azure functions but I'm not familiar enough to know if that would work like I think it would.

On Sun, Jan 8, 2023 at 2:42 AM Chris Gough @.***> wrote:

Is this about making the ML model deployments more cost-effective, or the rest of it (front-end, Flask microservice, celery, DB, redis)? If it's about the prediction endpoint...

What if we put the celery worker and prediction endpoint in the same worker (docker-composition / host), and had some lightweight supervisor process that shut it down when there was no workload (i.e. when the queue had been empty for a minute), and woke it up again when the queue had been non-empty for longer than some threshold (e.g. 3 minutes, assuming it takes less than 2 minutes to boot up). So when the supervisor woke the worker up, it would both have the prediction endpoint and the celery worker munching through the queue and getting predictions done until there were no more, then it would go to sleep. Yes I'm sure there is a K8 way to do this, but maybe an Azure native way too? Something for the cloud guys to look at...

This way, if one of us had a cheap but unreliable celery/prediction worker (i.e. on a workstation) the queue would seldom if ever be "non-empty for some time", unless things were getting really busy with users, so the supervisor would seldom need to wake up the worker and start spending real money. Unless there was some outage of the workstation, in which case (if anyone was actually using the app) the supervisor would pick up the slack.

— Reply to this email directly, view it on GitHub https://github.com/Tech4Tracing/CartridgeOCR/issues/107#issuecomment-1374796985, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAWPNL6OEFEIGGBUBHCLNTDWRKKYRANCNFSM6AAAAAATLNWYME . You are receiving this because you authored the thread.Message ID: @.***>

monkeypants commented 1 year ago

oh, the broker shouldn't be expensive, it doesn't have to work very hard. One cheeky idea would be to swap redis for AWS SQS as a broker, keeping SQLAlchemy as the results backend... but surely there's a cost-effective Azure way.

I think we need one broker that is well known and in one place, even if the workers are sliding all around the internet, because it will need to be at a well known address for all those sliding workers and for the flask api. Unless we make a broker-broker (e.g. lookup in the WK DB table), which would be rather Rube Goldburgish

If we keep the worker with the flask app, then we would need a different kind of broker between the worker and the sliding prediction apis. It could be a DB table, but it's still kinda messy because of the networking. If the worker is adjacent to the prediction api, it would be initiating the connection to the broker in the easy direction.

What branch should I be looking at BTW? I grepped for celery in main and realised it wasn't there.

simra commented 1 year ago

What branch should I be looking at BTW? I grepped for celery in main and realised it wasn't there.

celery should be used in CartridgeOCR/main- maybe you didn't pull? The celery task is defined here: https://github.com/Tech4Tracing/CartridgeOCR/blob/ce41c7f2d8cdc4f6a173862615147da7d4f47784/src/annotation/annotations_app/tasks/predict.py#L28

If we keep the worker with the flask app, then we would need a different kind of broker between the worker and the sliding prediction apis.

I'm not sure it's needed to be kept separate because the prediction API knows nothing about the broker- the annotations flask app queues up tasks, and the celery worker in the same container de-queues them and calls over the internet to the prediction container. The call blocks until prediction completes and then the celery worker pushes the result directly into the database.

I originally tried rabbitmq as the broker which would probably cost less than keeping a running instance of redis, but I had issues getting the image to spin up in the deployed app instance. Probably with some fiddly work we could get it running.

simra commented 1 year ago

btw the pulumi settings are in t4t-infrastructure branch simra/azure

simra commented 1 year ago

Fixed for the time being in PR #129

Tech4Tracing / CartridgeOCR

Migrate to k8s and figure out cost-effective API deployment #107