Open singhsegv opened 1 month ago
@singhsegv
I just want to ask you a couple of questions first. How did you deploy OpenWhisk-2.0.0? Are you able to invoke an action? What is your concurrency limit and userMemory assigned to invokers?
Hey @style95 thank for taking this up
How did you deploy OpenWhisk-2.0.0? Are you able to invoke an action? I haven't deployed this till now. I've been working with OpenWhisk 1.0.0 which is the one that comes with https://github.com/apache/openwhisk-deploy-kube repository. I've raised an issue there to understand how to get OpenWhisk 2.0.0 up and running with k8s where you gave some pointers https://github.com/apache/openwhisk-deploy-kube/issues/781#issuecomment-2316569110 . I'll be working on getting this up in the meantime.
So this question is something that is part of my doubts. As in, if I want to deploy OpenWhisk 2.0.0 in a multi node setting in a scalable manner, how should I go about it? Is ansible a way for that, or is there some way to user kubernetes for this?
What is your concurrency limit and userMemory assigned to invokers? My action concurrency limits are 1 and there are 3 invokers each with ~20000m of memory. After checking in grafana dashboard this seems to be fine since each invoker showed 40 pods of 512mb memory each at peak.
Some more updates from the benchmarking I did in the meantime:
So I think the warm containers not being used amongst different actions is what's causing my workflows to not scale. I saw invoker tags based scheduling in the docs and that is some temporary fix for my use case, that is in 2.0.0 and not 1.0.0.
But bigger concern is my limited understanding of warm containers reuse amongst different actions. Where do I get more information about this? Is this the intentional way warm containers reuse is supposed to run?
- But when I ran benchmarking for another action B after A, the warm containers were not used. Even though both A and B use same runtimes, same amount of memory and same requirements for pip.
So I think the warm containers not being used amongst different actions is what's causing my workflows to not scale. I saw invoker tags based scheduling in the docs and that is some temporary fix for my use case, that is in 2.0.0 and not 1.0.0.
But bigger concern is my limited understanding of warm containers reuse amongst different actions. Where do I get more information about this? Is this the intentional way warm containers reuse is supposed to run?
You're experiencing the hot spotting / container swapping problem of the best effort 1.0.0 scheduling algorithm. If your container pool is full and no functions exist for Action B, you need to evict an Action A in order to cold start an Action B. But also want to clarify that you shouldn't expect containers to get reused for multiple actions. Once a container is bound to an action, it can only run executions of that action even if it's using the same runtime and memory profile; there are many reasons for this but most importantly is security and data isolation.
You will find that performance should be significantly better on Openwhisk 2.0 with the new scheduler for the traffic pattern you're trying to test.
I think @bdoyle0182 comment hits the root of the confusion. OpenWhisk will never reuse a container that was running action A to now run action B, even if A and B are actions of the same user that used the same runtime+memory combination.
There is a related concept of a stem cell container (the default configuration is here https://github.com/apache/openwhisk/blob/master/ansible/files/runtimes.json#L44-L56), where if there is unused capacity the system tries to hide container creation latency by keeping a few unused containers for popular runtimes+memory combinations up and running into which it can inject the code for a function on its first invocation. But once the code is injected, these containers are bound to a specific function and will never be used for anything else.
Indeed @dgrove-oss, the explanation by @bdoyle0182 made my understanding about the underlying problem in my benchmarking technique and warm container re-use much clear. Thanks a lot @bdoyle0182.
Circling back to my main question back again, is setting up OpenWhisk 2.0.0 on a kubernetes cluster for benchmarking is a good way forward? Or are there any other well tested scalable ways for multi-node deployment of the whole stack. I've some experience with Ansible but haven't used ansible with a goal of multi-node clustering.
Since I've realized that OpenWhisk 2.0.0 comes with a lot of improvements and worth spending time into instead of writing hackish fixes into version 1 for my use cases, I am trying to get the helm chart to support 2.0.0 as this should help others looking to run the latest version too.
Sorry to be slow; I thought I had responded but actually hadn't.
I'll have to defer to others (@style95 @bdoyle0182) to comment about how they are deploying OpenWhisk 2.0. From a community standpoint, I think it would be great if we could get the helm chart updated. It's unfortunately a couple of years out-of-date, so it may take some effort to update version dependencies, etc.
Yes, that’s my long-overdue task. I’ve just started looking into it but haven’t completed it yet. I see some areas for improvement, but I’m still struggling to set up my own Kubernetes environment first. Since I’m involved in this project purely out of personal interest, it’s taking more time than I expected. @singhsegv, if you could help update the chart, that would be great. I think I can assist with your work as well.
Hey @style95 I've started working on it. Planning to update and do sanity testing for non openwhisk images first like redis, kafka, etc. Then I'll move on to controller, invoker and other openwhisk related images. Kinda stuck with a paper deadlines, but expecting to make a PR soon and then iterate over it.
@singhsegv Great! Feel free to reach out to me on Slack.
I am very confused about how the OpenWhisk 2.0.0 is meant to be deployed for a scalable bechmarking setting. Need some help from the maintainers to understand what am I missing since I've spent a large amount of time now and still missing some key pieces.
Context
We are using OpenWhisk for a research project where workflows (sequential as well as fork/join) are to be deployed and benchmarked at 1/4/8 RPS etc for long period of times. This is to compare private cloud FaaS vs public cloud FaaS.
Current Infrastructure Setting
We have a in-house cluster with around 10 VMs running on different nodes, 50 vCPUs and around 200Gb of memory. Since I am new to this, I've initially followed https://github.com/apache/openwhisk-deploy-kube to deploy it and along with OpenWhisk Composer, was able to get the workflows running with a lot of small fixes and changes.
Problems with Current Infrastructure
/init
and I am unable to debug why that is happening.Main Doubts about scaling
@style95 @dgrove-oss Since you people have been active in the community and have answered some of my previous queries too, any help on this will be very appreciated.
We are planning to go all in with OpenWhisk for our research and planning to contribute some good changes back to the community relating to FaaS at edge and improving the communication times in FaaS. But since none of us have infrastructure as our strong suite, getting over these initial hiccups is a becoming a blocker for us. So looking forward to some help, thanks :).