[New Scheduler] Latency vs. Resources Tradeoff Discussion

bdoyle0182 commented 2 years ago

This discussion originated on slack. I'm moving here for more formal discussion from the community on the topic.

Original post:

@bdoyle0182: Just wanted to open a discussion about the new scheduler. It's very optimized towards short running requests i.e. few milliseconds. However if there are very long running functions i.e. 10+ seconds then it scales out very quickly since the container throughput calculation is just going to be that it needs a new container for each added level of concurrency, this obviously is a very miniscule amount of faas use cases but it's supported nonetheless. These use cases are much more async then the normal use case of sync responses expected in a reasonable http request time of a few milliseconds so some latency to wait for available space should be much more acceptable. For example if a function takes 10 seconds to run, the user of that function won't really care if it has to wait 2-3 seconds for available space, and both the namespace and operator would likely prefer latency over uncontrolled fan out of concurrency. The problem imo is that the activation staleness value is constant for all function types (currently 100ms). 100ms definitely makes sense for anything that runs within a second, but do we think that we could make this value dynamic to what the average duration is for that function? Or if I'm on the right track here on how we could potentially control fan out of long running functions and prefer latency over fan out?

@style95: Yes, it's worth discussing. What you have said is correct. When designing the new scheduler, we prioritized latency over resources. It was based on the thought that public clouds like AWS would try to minimize latency no matter which type of functions are running and we also wanted to reduce the latency as much as we can. But it can lead to too many containers being provisioned at once. And it caused some trouble in our environment too when there are not many invoker nodes. This issue especially sticks out for long-running action as container provision takes generally more than 100ms. So even if more containers are being provisioned, messages become easily staled because all running containers are already handling activations that will take more than 100ms, and container provision also takes more than 100ms in turn activations in the queue generally wait for more than 100ms. One guard here is the scheduler does not provision containers more than the number of messages. So when there are 4 waiting messages, it only creates containers of up to 4. But if the concurrent limit is big(it's common for public clouds) and a huge number of messages are incoming, it will try to create a huge number of containers at once.

We need more ways to do the fine-grained control of provisioning.

@bdoyle0182: This issue especially sticks out for long-running action as container provision takes generally more than 100ms yes this is exactly what I'm finding. Container provision takes anywhere from 500ms to 2 seconds so when the wait time is 100ms the fan out of containers can be particularly bad because it checks every 100ms and provisions more each time and there won't be any activations complete for a couple seconds

and creating a huge number of containers at once can slow down the docker daemon further making provisioning even slower (though with the new scheduler container provisioning is balanced across hosts unlike the old scheduler which is just one of many huge wins on keeping the docker daemon under control :slightly_smiling_face: )

@style95: Yes. So I naively thought we need to control the number of concurrent provisioning. If it does not impact the whole system, we can still provision many containers for actions. But if it tries to create too many containers and it is expected it would cause any issues to the whole system, we can throttle them. But I couldn't think of it deeply yet.

@rabbah: How do things look for functions that run for minutes? Will check out discussion in GitHub. I'm curious if there should be multiple schedulers which can be tailored for the function modality.

bdoyle0182 commented 2 years ago

@rabbah things fan out dramatically for functions that run for minutes. Long running functions to me make it seem like intra-container concurrency of the utmost importance. Otherwise you eat away at the memory pool very quickly where much of the overhead are the fixed assets required to run the server / container. My guess is in the vast majority of cases is that the increase in memory usage per additional activation within a single container is negligible. However right now only nodejs supports intra-container concurrency and it requires the user to very smartly design their function around it and to do benchmarking.

@style95: On the topic of 100ms for activation staleness, does that make sense as the default. Cold starts in any system as far as I know will not be 100ms. 500ms-1s is more the norm in real world use with a bunch of variables that could make it even higher than that. If latency added to create a new container is far more than 100ms, does it make sense to only wait 100ms before deciding to spawn a new container? The latency is then 100ms + cold start time. Of course the tradeoff is if you increased it to say 1s, the latency if still needing a cold start is then 1s + cold start time so I'm not sure which is better.

style95 commented 2 years ago

Let me share the current behavior and my opinion.

The new scheduler is designed based on the assumption that latency is the most important. Some users of our downstream were sensitive to the latency. Some of them were even concerned about the hundreds of milliseconds of the wait time. And we didn't want to accept a few seconds of the wait time.

Based on this idea, let me share how the new scheduler works. First, the scheduler will look for the average duration of the given action. Once the action is invoked at least one time, there will be activations and we can figure out the average duration. This is handled by ElasticSearchDurationChecker. When a memory queue starts up, it will try to get the average duration. Once we get the duration, we can estimate the processing power of one container for the given action.

For example, if the duration is 10ms, theoretically one container can handle 100 activations in 1 second while it will handle only 1 activation for an action with 1s duration. We can easily calculate the required number of containers.

If an action had never been invoked, we can't figure out the average duration and the scheduler will just create one container. If the action quickly finishes, then we can figure out the average duration again and it would work accordingly. On the other hand, if it takes more than the scheduling interval(100ms) to finish, Then the duration of action is at least bigger than 100ms, it could take 1s ~ 10s. Since we have no idea yet the scheduler will add the same number of containers with the number of stale activations in the queue. This is where staleness is introduced.

One more thing to consider is that even for short-running actions, some activations can be stale. We properly calculated the required number of containers but the duration can vary and some messages can be stale while some containers are running. It stands for existing containers are not enough to handle existing activations. Let's say 10 activations are incoming every 100 milliseconds and existing containers could only handle 7 activations during 100ms, we need to add more containers. So we calculate the required number of additional containers based on the number of stale activations and the average duration.

val containerThroughput = StaleThreshold / duration
val num = ceiling(availableMsg.toDouble / containerThroughput)

Also, if the calculated num is 5 while there are only 3 activations in the queue, we don't need to add 5 containers as 2 of them will be idle. So we only add 3 containers. Considering the fact that this case can repeatedly happen because container creation generally takes more than 100ms, we should take the number of in-progress(being created) containers into account.

val actualNum = (if (num > availableMsg) availableMsg else num) - inProgress

This is basically how the new scheduler works.

Now the issue is, in the case of long-running actions such as 10s, since its processing power would be 0.01(100ms/10s), the scheduler will try to create the same number of containers with the number of activations. So when 10 activations come, it will try to create 10 containers to handle them. When 100 activations come, it will create 100 containers. (while at the beginning it will only add one container as an initial container.) So this could end up spending all resources, we have to properly throttle them with the namespace limit. If the namespace limit is 30, then only 30 containers will be created and 70 activations will be waiting in the queue. Only after 40 seconds, after 4 rounds, all activations will be handled.

This is an example, but we thought the 40s of the wait time is too much and we wanted to minimize wait time no matter which kind of action is running. But this could create a huge number of containers within a short period of time and it could overload the Docker engine or a K8S API server. Also, if one action spawns a huge number of containers, it would affect other actions too as the Docker engine would be busy creating them.

Regarding the idea to increase the staleness threshold, I am not sure. Some users still may want the short wait time even if their actions are long-running actions. Maybe we can introduce another throttling for container creation and it should consider the fairness among actions. Also on the invoker side, it should create containers in batch with a limit on the number of containers in each batch. (The Docker client already has such a batch but the K8S client doesn't.)

And it would be great for OW operators if we can control the aggressiveness(whether to create containers more or less aggressively) of the scheduler.

bdoyle0182 commented 2 years ago

Also just fyi, I'm working on a change that at least for the case of the scheduling decision maker for the initial case of an action having not being seen adding (or in the case of no op duration checker just if the action hasn't been seen in a while and the queue is stopped) a ratio of containers to bring up based on the number of stale activations while waiting for the first activation to be returned.

So instead of,

          case (Running, None) if staleActivationNum > 0 =>
            // we can safely get the value as we already checked the existence
            val num = ceiling(staleActivationNum - inProgress)

it's now this where the default is 1.0 to keep the normal behavior the same but this can be lowered to decrease aggressiveness

            // we can safely get the value as we already checked the existence
            val num = ceiling(schedulingConfig.initialStaleProvisionRatio * (staleActivationNum - inProgress))

Though since the checkInterval is 100ms, this check will occur many times spinning up more containers while waiting for the first activation response when an activation takes 1-2 seconds to respond so the only way for me to test this right now is to make the checkInterval a few seconds.

That's just one aspect of this topic though but thought I'd point it out. I'll come back to your comment separately

apache / openwhisk

[New Scheduler] Latency vs. Resources Tradeoff Discussion #5256