Query Processing Resiliency and Workload Isolation

vvivekiyer commented 2 years ago

To make our query processing pipeline resilient to failures (like server slowness, crashes, etc), we can make some improvements in the broker and server components. I am working on a doc to identify and expand these issues with design and implementation details. The identified improvements are called out below:

Server Selection

When a broker receives a query, it identifies the list of servers for each segment associated with the query. Among these identified servers, one server per segment is chosen based on a round-robin approach. This approach is light-weight and works for most cases. However, we can add some intelligence to this layer to pick servers more intelligently. Some of the parameters that we could use are:

Server Latency
Number of inflight queries to server
Query cost
Server Load
Server capability (heterogeneous servers).

Fairness in Query Scheduling (Resource isolation)

Currently, there’s no scheduling of queries either at broker or server based on weight of the query or priority assigned to a query. We can introduce a fair scheduling scheme at the broker and server level. Note that we have a PriorityScheduler that schedules queries based on priority but hasn’t been tested. We could improve/test it before working on alternatives.

For example: consider three classes of queries: C1: high cost queries. C2: medium cost queries C3: low cost queries

Broker Let's say broker receives 3 C1 queries, 3 C2 queries and 3 C3 queries. Now, say there are three available servers that can process these 9 queries that we received. We should ideally be distributing them to the servers as follows: Server 1 = {1 C1, 1 C2, 1 C3} Server 2 = {1 C1, 1 C2, 1 C3} Server 3 = {1 C1, 1 C2, 1 C3}

This heavily overlaps with the “SERVER SELECTION” Section (1) of the document. But adding it as a separate topic because this could be done at each "Server" as opposed to "SERVER SELECTION" which would be done at the broker.

Query Pre-emption at Broker and Server

Currently, the broker pre-empts a query when the timeout is reached. Our pre-emption logic can be extended to both broker and servers to pre-empt based on a number of factors including

Timeout Broker currently has logic to return queries that have reached timeout. However, this can be improved at the server
Memory consumption - to avoid out-of-memory errors This could be a long-pole item because java (AFAIK) does not provide a reliable way of measuring heap usage.

Rate Limiting

Today, queries are rate-limited only based on QPS. We should improve rate-limiting to be based on a number of factors like load on server, latency, etc. Rate-limiting (QoS queue) can be done either at broker or server or both.

Server Circuit Breaker

We should take servers out of rotation (and adaptively bring them back) when servers are failing (n/w partitioned, disk failure, connectivity failures). We currently have a framework built for this in #8490. But we could improve this by adding more trigger points for marking a server unhealthy.

Workload Management

The noisy-neighbor problem is a serious limitation in shared-tenant and dedicated tenant pinot setups. In a shared-tenant setup, expensive queries from one table can negatively impact the performance of other tables. Similarly, in a dedicated tenant setup, expensive queries from one client/use-case can affect others within the same table. For example, in the T1 table, the latencies of high-sensitive queries are degraded by the execution of a few, occasional TEXT_MATCH queries.

vvivekiyer commented 2 years ago

cc: @siddharthteotia @Jackie-Jiang @jasperjiaguo

siddharthteotia commented 2 years ago

@vvivekiyer has code changes / prototype in progress along with adding those concrete details to the design doc for server selection. Will share for review soon

vvivekiyer commented 2 years ago

The plan is to address each section one by one. As the first step, I've expanded the "Server Selection" section with a focus on two parameters - "Server Latency" and "Number of inflight queries". Here is the design doc - https://docs.google.com/document/d/1w8YVpKIj0S62NvwDpf1HgruwxJYJ6ODuKQLjGXupH8w/edit?usp=sharing

Please review.

vvivekiyer commented 2 years ago

@Jackie-Jiang please review the design doc above.

mcvsubbu commented 2 years ago

Some thoughts.

The broker could keep track of a "health score" for serves and rank them. The score could include a variety of dimensions like:

How much thread cpu the server has used up in the last N time units?
What has been the response time of the server for the last M queries (perhaps normalized to the thread cpu used?)
maybe others as well

The servers could also send a score of their own along with each response to the brokers after monitoring their current health. The brokers could include that information in deciding the "weight" of the server for inclusion in a query.

We will need to ensure that the thing does not oscillate wildly, with all brokers backing off for the same server, and then all of them subjecting the server to an onslaught thinking that it is lightly loaded.

kishoreg commented 2 years ago

Thanks for putting this together. It's very detailed and well explained.

What I am not sure if there is a need for this in Pinot in a real life use case.

Traditionally, these strategies work well for task distribution where a task can be executed on any of the worker nodes and there is a huge variance among the task.. something like one task taking 10 minutes vs 1 minute vs hour etc.note that I explicitly left out millisecond and second because when a task takes such a small amount of time, the feedback loop needs to be really fast.

In case of Pinot, we don't really have a lot of choice since each segment is already assigned to a set of nodes mostly 3 for fault tolerance. There is not much, we can do given that we just have 3 options for a given segment.

It will be great to see a real life example where we could have either seen better latency or throughput based because of this change..

siddharthteotia commented 2 years ago

We have seen the following repeatedly in production:

A server slows down: -- because the queries already running there are taking up huge cpu / GC / -- or the server indeed happens to be running a lot of expensive queries -- or something else is going on the server that is hogging cpu. -- In one of the cases queries were simply waiting in schedQueue and BPF profiler was taking up CPU -- Basically latencies observed from a server are spiking, we are getting alerts on latency SLA getting breached.

We mitigated this by disabling / enabling instance temporarily and forcing the broker to route queries to other replicas. Helped recover latency spikes that were coming from the slow server. This has happened so many times during oncall etc and almost always we circle back to the fact that "if there was single node/server resiliency, some of the thing could have been avoided or may be there were fewer alerts, latency spikes could have been less worse"

We have also seen the following (this is the 2nd sub-problem outlined in the issue)

A slow query on the server takes up CPU and compromises the other queued up and/or running queries that could have finished faster.
Eventually all queries (whether expensive or inexpensive) take up a lot of time.
There is no work stealing or scheduling -- there is no way to forestall a query based on priority / weight and "make slow queries run slower by scheduling them late / giving them few cores etc" and reduce the cascading impact of expensive queries on other queries.

@vvivekiyer - please add other notes on the background of problem from the other concrete production examples.

The current design doc addresses the first problem as of now and we have a code / impl in progress that we are planning to test this week and also share the PR / discuss feedback etc on the approach.

PS - @vvivekiyer and I had sync'd up with @Jackie-Jiang sometime back to tell we are working on it.

kishoreg commented 2 years ago

@siddharthteotia I added some comments to the design doc. Overall, I feel that the design has some assumptions that having this will result in better latency for the end-user. But the data is missing to back up the assumptions.

I don't remember seeing this problem at LinkedIn. Things might have changed but my concerns expressed in the previous comment still remain the same. I am having a hard time thinking about the practical scenario's where this will be useful.

It will really help if we can see some real production graphs where we can clearly see that this solution will be applicable.

vvivekiyer commented 2 years ago

@kishoreg Thanks for the comments. I'll respond to the comments in the doc. I just want to reiterate the following - the motivation of this project is to NOT improve latency for users (at least when everything is working fine). It is to improve resiliency so that when some servers are affected (CPU/disk slowness, n/w issue, etc), users don't see major degradation in performance.

siddharthteotia commented 2 years ago

+1 on what @vvivekiyer mentioned.

I discussed this offline with @kishoreg . I think Kishore thought that proposed design for the server selection problem and the WIP PR are sort of set in stone. That is not the case. We are testing and validating the approach and nothing is going to get merged unless we get confidence in the implementation and get validation before v/s after on a workload that reproduces the problem.

Having said that, the WIP PR is deployed on our internal perf cluster with prod data and we are running validation, gathering data points etc. So, will be sharing all of that.

Things are still up for discussion and brainstorming.

siddharthteotia commented 2 years ago

Update on server selection subproblem -- By next week, we will share some of our results from the POC code we had written, improved and been testing on internal perf cluster and tested out a couple of prod use cases.

Along with that we have also written a simulator in Python to model several different scenarios in theory and cross validated that with the new Pinot server selector code

Will share the combined data points and PR

FYI @Jackie-Jiang

jonminter commented 2 years ago

So my coworker just pointed me to this issue. I'm an engineer at Stripe and we've just started experimenting with some similar ideas for making Pinot more reliable. One thing I had just implemented in our fork and was testing today was a rate limiter for the Pinot broker that measures observed query latency and estimates server queue sizes using the TCP Vegas congestion control algorithm. I'm using this rate limiter implementation from Netflix.

I've only just started testing this but I can provide some data once I've had a chance to do some more comprehensive tests. But initial tests it looks promising.

I had similar thoughts about measuring server latencies and implementing load balanced routing and looks like you guys have already started implementing that which is great to see! We have observed similar issues in our clusters with a server that has degraded performance affecting overall latency.

Also had some ideas about how to meter resource usage on the servers during rebalances as we've noticed this can impact query performance as well.

jonminter commented 2 years ago

Along with that we have also written a simulator in Python to model several different scenarios in theory and cross validated that with the new Pinot server selector code

Very curious to hear more about the simulator as well. Is this a simulation of a Pinot cluster? @siddharthteotia

vvivekiyer commented 2 years ago

I had similar thoughts about measuring server latencies and implementing load balanced routing and looks like you guys have already started implementing that which is great to see! We have observed similar issues in our clusters with a server that has degraded performance affecting overall latency.

Thanks. Please feel free to look at the PR and provide your feedback. The results are documented here

Very curious to hear more about the simulator as well. Is this a simulation of a Pinot cluster? @siddharthteotia

This is a simulation of how the pinot brokers and servers would behave with the Adaptive Server Selection algorithm. The results of this simulation are documented here . @jasperjiaguo helped write this simulator and will be publishing it in a public github repo some time next week.

siddharthteotia commented 1 year ago

FYI @npawar

apache / pinot