Open vvivekiyer opened 2 years ago
cc: @siddharthteotia @Jackie-Jiang @jasperjiaguo
@vvivekiyer has code changes / prototype in progress along with adding those concrete details to the design doc for server selection. Will share for review soon
The plan is to address each section one by one. As the first step, I've expanded the "Server Selection" section with a focus on two parameters - "Server Latency" and "Number of inflight queries". Here is the design doc - https://docs.google.com/document/d/1w8YVpKIj0S62NvwDpf1HgruwxJYJ6ODuKQLjGXupH8w/edit?usp=sharing
Please review.
@Jackie-Jiang please review the design doc above.
Some thoughts.
The broker could keep track of a "health score" for serves and rank them. The score could include a variety of dimensions like:
The servers could also send a score of their own along with each response to the brokers after monitoring their current health. The brokers could include that information in deciding the "weight" of the server for inclusion in a query.
We will need to ensure that the thing does not oscillate wildly, with all brokers backing off for the same server, and then all of them subjecting the server to an onslaught thinking that it is lightly loaded.
Thanks for putting this together. It's very detailed and well explained.
What I am not sure if there is a need for this in Pinot in a real life use case.
Traditionally, these strategies work well for task distribution where a task can be executed on any of the worker nodes and there is a huge variance among the task.. something like one task taking 10 minutes vs 1 minute vs hour etc.note that I explicitly left out millisecond and second because when a task takes such a small amount of time, the feedback loop needs to be really fast.
In case of Pinot, we don't really have a lot of choice since each segment is already assigned to a set of nodes mostly 3 for fault tolerance. There is not much, we can do given that we just have 3 options for a given segment.
It will be great to see a real life example where we could have either seen better latency or throughput based because of this change..
We have seen the following repeatedly in production:
We mitigated this by disabling / enabling instance temporarily and forcing the broker to route queries to other replicas. Helped recover latency spikes that were coming from the slow server. This has happened so many times during oncall etc and almost always we circle back to the fact that "if there was single node/server resiliency, some of the thing could have been avoided or may be there were fewer alerts, latency spikes could have been less worse"
We have also seen the following (this is the 2nd sub-problem outlined in the issue)
@vvivekiyer - please add other notes on the background of problem from the other concrete production examples.
The current design doc addresses the first problem as of now and we have a code / impl in progress that we are planning to test this week and also share the PR / discuss feedback etc on the approach.
PS - @vvivekiyer and I had sync'd up with @Jackie-Jiang sometime back to tell we are working on it.
@siddharthteotia I added some comments to the design doc. Overall, I feel that the design has some assumptions that having this will result in better latency for the end-user. But the data is missing to back up the assumptions.
I don't remember seeing this problem at LinkedIn. Things might have changed but my concerns expressed in the previous comment still remain the same. I am having a hard time thinking about the practical scenario's where this will be useful.
It will really help if we can see some real production graphs where we can clearly see that this solution will be applicable.
@kishoreg Thanks for the comments. I'll respond to the comments in the doc. I just want to reiterate the following - the motivation of this project is to NOT improve latency for users (at least when everything is working fine). It is to improve resiliency so that when some servers are affected (CPU/disk slowness, n/w issue, etc), users don't see major degradation in performance.
+1 on what @vvivekiyer mentioned.
I discussed this offline with @kishoreg . I think Kishore thought that proposed design for the server selection problem and the WIP PR are sort of set in stone. That is not the case. We are testing and validating the approach and nothing is going to get merged unless we get confidence in the implementation and get validation before v/s after on a workload that reproduces the problem.
Having said that, the WIP PR is deployed on our internal perf cluster with prod data and we are running validation, gathering data points etc. So, will be sharing all of that.
Things are still up for discussion and brainstorming.
Update on server selection subproblem -- By next week, we will share some of our results from the POC code we had written, improved and been testing on internal perf cluster and tested out a couple of prod use cases.
Along with that we have also written a simulator in Python to model several different scenarios in theory and cross validated that with the new Pinot server selector code
Will share the combined data points and PR
FYI @Jackie-Jiang
So my coworker just pointed me to this issue. I'm an engineer at Stripe and we've just started experimenting with some similar ideas for making Pinot more reliable. One thing I had just implemented in our fork and was testing today was a rate limiter for the Pinot broker that measures observed query latency and estimates server queue sizes using the TCP Vegas congestion control algorithm. I'm using this rate limiter implementation from Netflix.
I've only just started testing this but I can provide some data once I've had a chance to do some more comprehensive tests. But initial tests it looks promising.
I had similar thoughts about measuring server latencies and implementing load balanced routing and looks like you guys have already started implementing that which is great to see! We have observed similar issues in our clusters with a server that has degraded performance affecting overall latency.
Also had some ideas about how to meter resource usage on the servers during rebalances as we've noticed this can impact query performance as well.
Along with that we have also written a simulator in Python to model several different scenarios in theory and cross validated that with the new Pinot server selector code
Very curious to hear more about the simulator as well. Is this a simulation of a Pinot cluster? @siddharthteotia
I had similar thoughts about measuring server latencies and implementing load balanced routing and looks like you guys have already started implementing that which is great to see! We have observed similar issues in our clusters with a server that has degraded performance affecting overall latency.
Thanks. Please feel free to look at the PR and provide your feedback. The results are documented here
Very curious to hear more about the simulator as well. Is this a simulation of a Pinot cluster? @siddharthteotia
This is a simulation of how the pinot brokers and servers would behave with the Adaptive Server Selection algorithm. The results of this simulation are documented here . @jasperjiaguo helped write this simulator and will be publishing it in a public github repo some time next week.
FYI @npawar
To make our query processing pipeline resilient to failures (like server slowness, crashes, etc), we can make some improvements in the broker and server components. I am working on a doc to identify and expand these issues with design and implementation details. The identified improvements are called out below:
Server Selection
When a broker receives a query, it identifies the list of servers for each segment associated with the query. Among these identified servers, one server per segment is chosen based on a round-robin approach. This approach is light-weight and works for most cases. However, we can add some intelligence to this layer to pick servers more intelligently. Some of the parameters that we could use are:
Fairness in Query Scheduling (Resource isolation)
Currently, there’s no scheduling of queries either at broker or server based on weight of the query or priority assigned to a query. We can introduce a fair scheduling scheme at the broker and server level. Note that we have a PriorityScheduler that schedules queries based on priority but hasn’t been tested. We could improve/test it before working on alternatives.
For example: consider three classes of queries: C1: high cost queries. C2: medium cost queries C3: low cost queries
Broker Let's say broker receives 3 C1 queries, 3 C2 queries and 3 C3 queries. Now, say there are three available servers that can process these 9 queries that we received. We should ideally be distributing them to the servers as follows: Server 1 = {1 C1, 1 C2, 1 C3} Server 2 = {1 C1, 1 C2, 1 C3} Server 3 = {1 C1, 1 C2, 1 C3}
This heavily overlaps with the “SERVER SELECTION” Section (1) of the document. But adding it as a separate topic because this could be done at each "Server" as opposed to "SERVER SELECTION" which would be done at the broker.
Query Pre-emption at Broker and Server
Currently, the broker pre-empts a query when the timeout is reached. Our pre-emption logic can be extended to both broker and servers to pre-empt based on a number of factors including
Rate Limiting
Today, queries are rate-limited only based on QPS. We should improve rate-limiting to be based on a number of factors like load on server, latency, etc. Rate-limiting (QoS queue) can be done either at broker or server or both.
Server Circuit Breaker
We should take servers out of rotation (and adaptively bring them back) when servers are failing (n/w partitioned, disk failure, connectivity failures). We currently have a framework built for this in #8490. But we could improve this by adding more trigger points for marking a server unhealthy.
Workload Management
The noisy-neighbor problem is a serious limitation in shared-tenant and dedicated tenant pinot setups. In a shared-tenant setup, expensive queries from one table can negatively impact the performance of other tables. Similarly, in a dedicated tenant setup, expensive queries from one client/use-case can affect others within the same table. For example, in the T1 table, the latencies of high-sensitive queries are degraded by the execution of a few, occasional TEXT_MATCH queries.