Closed wtt40122 closed 5 months ago
As the Double 11 (November 11th) shopping festival approached, in order to better execute the related promotional activities, almost all business units conducted advanced load testing and drills on their services. They also adjusted their capacity based on the drill results. Additionally, this year, the group's product launch event was scheduled close to Double 11, which made everyone pay even more attention to this event. As a result, they allocated larger buffer space and expanded their infrastructure for a safer and more significant capacity increase. Everything was in place, waiting for the launch event to begin.
After thorough preparations, the entire group eagerly entered the live broadcast segment of the launch event. During the live sales and promotional phase, due to a sudden surge in traffic and a significant increase in data volume, error messages started to grow exponentially. The primary tool for monitoring application performance, "oz-hera," used by the group, began to show abnormal service metrics. With the influx of traffic, the number of error messages continued to increase, and the amount of erroneous data being queried also grew rapidly. Since Prometheus API does not support pagination and limit restrictions, it attempted to fetch all metric data at once, causing a data overload that exceeded the page's loading capacity, ultimately leading to a page crash.
Investigation Data:
Upon observation, it was found that due to the lack of pagination, the backend did not impose an overall data size limit, and the frontend also lacked segmented loading design. This resulted in loading an excessive amount of data in a single request, ranging from 6,000 to 10,000+ records. A rough estimate suggests that the data sent to the frontend may reach gigabytes in size, ultimately leading to extended waiting times and page crashes. Conclusion: The main reasons for the page crash during queries are as follows:
Based on the analysis in section III, the goal of solving the problem is to improve the speed of metric queries and reduce the size of query results, which will effectively enhance page loading speed.
Introduction of Problematic Metrics: To improve query speed, let's first look at the problematic metric: "dubboProviderSLA," which records the volume of abnormal data in dubbo service server calls. To better understand the meaning of this metric, we need to consider measuring the service quality of a dubbo service not only based on local errors because often, in high load situations, client requests may fail even before reaching the server, or they may reach the server but time out before receiving a response. In such cases, relying solely on the server's metrics is insufficient to reflect the actual problem. Therefore, there's a need for a metric that better measures the availability of dubbo services from the client's perspective. We define this metric as "dubboProviderSLA."
The characteristics of the metric "dubboProviderSLA" are as follows: As an objective measure of the availability of dubbo services from the client's perspective, "dubboProviderSLA" naturally requires recording client-related information. This information may include the client application's ID, name, environment, and the instance IP of the client application.
Root Cause Analysis of the Problem: Under normal circumstances, if the server experiences errors in 5 methods, and if the service is a commonly used foundational service with 100 service invocations, each client service having 50 instances, based on Prometheus's label data format, the total data generated would be 25,000 records. However, during major promotional events, the primary factor causing data accumulation is the number of service instances, specifically the client IP. During these peak promotional periods, in order to ensure effective service capabilities under the impact of high external traffic, there is often a need to scale up, frequently resulting in dozens or even more than 100 instances, leading to an exponential increase in data volume.
Optimization Strategy: Based on the analysis mentioned above and after collective discussions within the team, it has been agreed that the primary factor leading to the surge in data is the "client instance IP" attribute label. This is the first key element to optimize. However, for a more effective optimization, we have also optimized the way server IPs are handled. By default, we do not group server IPs. This further narrows down the data to a smaller dimension, effectively reducing the total data volume. Finally, for added precaution, we have imposed a maximum limit on the data returned to the frontend, allowing a maximum of 1000 data records to be sent. This helps prevent extreme cases where an excessively large data volume could crash the frontend browser. Summarizing the optimization strategy: Optimization 1: Merge the "client IP" dimension, no longer displaying "client IP" in the frontend; specific information can be viewed in details. Optimization 2: Merge the "server IP" dimension, and by default, do not display server IP addresses. The frontend page will be redesigned, and the backend will provide a server IP list as a selection criterion for querying, allowing a maximum of 10 selections. Optimization 3: For metric queries, perform precise queries through grouped queries, no longer reusing labels for multiple metric queries. The previous "sumBy" code: The grouping accommodated multiple metric queries, including HTTP, Dubbo, Apus, Thrift, Redis, DB, and other metrics.
private String sumSumOverTimeFunc(String source) {
StringBuilder sb = new StringBuilder();
sb.append("sum(sum_over_time(");
sb.append(source);
sb.append(")) by (serverIp,job,application,methodName,serviceName,dataSource,sqlMethod,sql,serverEnv,serverZone,containerName,method,clientProjectId,clientProjectName,clientEnv,clientIp) ");
return sb.toString();
}
The optimized "sumBy" code is as follows: where each metric is precisely grouped by its label using "sumBy," aiming to improve the speed of Prometheus service queries:
private String sumSumOverTimeFunc(String source,String metric,String sumBy) {
StringBuilder sb = new StringBuilder();
sb.append("sum(sum_over_time(");
sb.append(source);
sb.append(")) ");
if (StringUtils.isNotBlank(sumBy)) {
sb.append(" by (").append(sumBy).append( ")");
}else {
switch (metric) {
case "dubboProviderSLAError":
sb.append(" by (application,methodName,serviceName,serverEnv,serverZone,clientProjectName,clientEnv) ");
break;
case "dubboConsumerError":
sb.append(" by (serverIp,application,methodName,serviceName,serverEnv,serverZone) ");
break;
case "dubboProviderError":
sb.append(" by (serverIp,application,methodName,serviceName,serverEnv,serverZone) ");
break;
case "httpError":
sb.append(" by (serverIp,application,methodName,serviceName,serverEnv,serverZone) ");
break;
case "httpClientError":
sb.append(" by (serverIp,application,methodName,serviceName,serverEnv,serverZone) ");
break;
case "redisError":
sb.append(" by (serverIp,application,method,serverEnv,serverZone) ");
break;
case "dbError":
sb.append(" by (serverIp,application,dataSource,sqlMethod,sql,serverEnv,serverZone) ");
break;
case "grpcClientError":
sb.append(" by (serverIp,application,methodName,serviceName,serverEnv,serverZone) ");
break;
case "grpcServerError":
sb.append(" by (serverIp,application,methodName,serviceName,serverEnv,serverZone) ");
break;
case "thriftServerError":
sb.append(" by (serverIp,application,methodName,serviceName,serverEnv,serverZone) ");
break;
case "thriftClientError":
sb.append(" by (serverIp,application,methodName,serviceName,serverEnv,serverZone) ");
break;
case "apusServerError":
sb.append(" by (serverIp,application,methodName,serviceName,serverEnv,serverZone) ");
break;
case "apusClientError":
sb.append(" by (serverIp,application,methodName,serviceName,serverEnv,serverZone) ");
break;
case "oracleError":
sb.append(" by (serverIp,application,dataSource,sqlMethod,sql,serverEnv,serverZone) ");
break;
case "elasticsearchClientError":
sb.append(" by (serverIp,application,dataSource,sqlMethod,sql,serverEnv,serverZone) ");
break;
default:
sb.append(" by (serverIp,application,methodName,serviceName,dataSource,sqlMethod,sql,serverEnv,serverZone,containerName,method,clientProjectId,clientProjectName,clientEnv) ");
}
}
Optimization 4: Frontend Support for Server IP Filtering
Optimization 5: Limiting Maximum Data Load
This is a fallback lossy optimization. If after merging clientIp and serverIp, the volume of errors remains enormous, we limit the data sent to the frontend to prevent page crashes. We set a maximum data limit of 1000 records for the frontend, and data beyond 1000 records is not loaded onto the page. The feasibility of this approach is based on the following reasons:
Adjustments Involved in this Optimization: After optimizing the page, the display of the serverIp dimension is removed by default. When navigating from the alert to the metric page, the frontend needs to accommodate server IP lists, and checkboxes must be selected. Queries are performed based on serverIp selection.
After the optimization, the data loading volume on the page has been significantly reduced. Normal data queries are now completed within 1 second, and even when dealing with larger data volumes, loading typically remains under 3 seconds.
During periods of particularly high traffic, we may encounter some performance issues, and we have made some optimizations and adjustments to make the program more robust