Closed gautamp8 closed 3 years ago
I see that the proposal suggests using KEDA for autoscaling and I think this a good idea. That's what we use in Apache Airflow official helm chart for autoscaling Celery workers.
However, there's one thing to remember when scaling down the number of celery workers: terminationGracePeriodSeconds
. I assume that all users expect workers to be terminated gracefully meaning that no task will be lost/terminated when running. This is particularly problematic with long-running tasks.
https://www.polidea.com/blog/application-scalability-kubernetes/#tips-for-hpa
Is there a way to notify Kubernetes to extend the pod's termination grace period?
Is there a way to notify Kubernetes to extend the pod's termination grace period?
@thedrow users can adjust this parameter in pod definition. But this will differ from use case to use case. One possible hack is to set it to infinity and then terminate pod using preStop hook - but it's a hack and may affect cluster behavior. That was the main problem we had in Apache Airflow: tasks executed on Celery workers can take arbitrary long time (from seconds to hours).
Here is a discussion about this problem and workarounds from Kubernetes community: https://github.com/kubernetes/kubernetes/issues/45509 and to related KEPs: https://github.com/kubernetes/enhancements/pull/1828 https://github.com/kubernetes/enhancements/pull/1888
I'll review most of the style later as this CEP needs to be formatted according to the template.
@thedrow I've updated the CEP to strictly reflect the template. Kept the Future Work section as it is since you mentioned it'll be good to have.
Thanks for reviewing in detail. I'll get back on these comments with some questions/inputs as soon as I can.
* Operator should be aware of the number of processes/greenlets the worker is running and must make autoscaling decisions based on that. If we use greenlets, we only utilize one CPU per process and our main resource is I/O. Further work could be done there when calium or another network management platform is installed. We should investigate this further. If we use the prefork pool, we should count the number of workers to calculate concurrency.
Yeah. So in the case of the prefork pool, we could calculate the number of worker pods by taking into account min of CPU cores and concurrency value set by the user. For Solo, we can only increase the number of pods I think.
For greenlets, we could accept the min and max concurrency values from CRD, similar to what celery already does, and only update the concurrency setting until we reach the threshold value specified for the workers. Beyond that, we'll increase the number of pods directly.
That said, I'd have to look into how operator could take care of this custom logic along with KEDA/HPA scaling implementation. Also, I've not worked with Cilium yet, I'll explore it as well.
* We may wish to connect to various Celery signals to notify Kubernetes of certain events which we can handle.
Do you mean these signals?
For example if we didn't execute a task for a certain amount of time and there are messages in the queue, we may wish to terminate the worker as it is malfunctioning. Another example is an internal error in Celery.
And yes, we should do this. I'll read more on all the signals and update the proposal with the ones we could handle. One additional thing I had in mind apart from examples you gave is implementing a simple kind of backpressure protocol, in case one of the components is overwhelmed, we emit certain events and propagate them back to the application or the starting point.
This requires a Python library.
Can you please elaborate a bit more on this?
* The CEP should address how the operator intends to install the message broker/results backend. This can possibly be done using the app's configuration. I think that this is a feature we want to implement at this stage for RabbitMQ and Redis. Support for other components can come on demand. We'll be doing this for Celery 5 anyway so it's worth investigating it now.
Yes, that makes sense. I did some exploration on how can we go forward with installing Redis/RabbitMQ cluster once Celery CR is created. The zero version solution could be simply to package the simple yaml
files with very basic customization properties along with the operator. Support for those properties could be added to CRD in an optional section.
A more scalable approach would be to use the standard helm package manager. The general idea is to decouple the installation/management of Redis/RMQ from the handlers taking care of celery workers.
I found the actively maintained helm charts by Bitnami for Redis and RMQ. However, the catch here is that Python handler taking care of creation/updation needs to somehow notify the Kubernetes to install the chart using helm. One way I thought of is that we also bundle helm operator with our operator. We then create a resource that this operator understands from our handler and it takes care of the rest. I'd like to discuss on the pros and cons of this approach. If someone has a better idea, I can explore it further.
* I'd like you to invest a little bit more time on coming up with new features. Your feedback is most valuable.
Yes! One additional thing operator could do is to handle the dev, test and production sandboxes as well for the given CR specification. We could use Kustomize for the same.
On a side note, I personally/or my company has not been a power user of Celery as such. We use really basic things so my knowledge is limited. But, I'm willing to explore and learn more on Kubernetes & Celery fronts. I feel Celery is awesome and at the same time a huge project to understand ins and outs of. Are there any places I can read more about how people use Celery in production?
For now, I'm gonna go ahead and try to read whole documentation of Celery to filter out some use-cases operator could take care of.
Sorry, I couldn't be active this past month. I'm still working on my knowledge of distributed systems in general. @thedrow let me know your inputs for my comment above whenever you can and I'll update the proposal accordingly.
For some use-cases of ours the ephemeral storage is quite important too, not just CPU and memory as some of our tasks are heavily using this feature of AWS for an example... So, it is important to be flexible with parameters that could affect autoscaling.
On a side note - It would be nice if Celery had cluster scaling API that we could "just" implement...
On a side note - It would be nice if Celery had cluster scaling API that we could "just" implement...
We're working on it. See #27.
@brainbreaker I was extremely busy with Celery 5 myself. Let's pick up the pace.
I think this PR is a fine first draft and we can merge it. Before we do that I'd like you to open an issue which details all our open items.
Created an issue with open points. We can continue the discussion there on those points. https://github.com/celery/ceps/issues/30
@thedrow Couple of things around next steps to start development -
I'm assuming we'll be making a new Github repository for celery operator under celery org. How does that process work? I also have this prototype repo on my personal account and was planning to extend over that only.
I've been through the contributing guide. However, let me know if are there any particular things you have in mind specific to this project's execution. As a first step, I can come up with a detailed subtasked plan and approximate timelines.
We can move the prototype repo to the organization.
We can move the prototype repo to the organization.
Great. I'll have to be part of the celery organization for that. I tried transferring ownership and the message was "You don’t have the permission to create public repositories on celery"
This PR adds a draft CEP for Celery Kubernetes Operator. For now, it is written keeping Celery 4.X in mind.
It is ready for the first round of reviews by maintainers.