apache / ozhera

Application Observable Platform in the Cloud Native Era
https://ozhera.m.one.mi.com
Apache License 2.0
81 stars 32 forks source link

Optimization during Ozher's major promotion period #125

Closed wtt40122 closed 4 months ago

wtt40122 commented 10 months ago

During periods of particularly high traffic, we may encounter some performance issues, and we have made some optimizations and adjustments to make the program more robust

gaoxh commented 10 months ago

Background of the Issue:

As the Double 11 (November 11th) shopping festival approached, in order to better execute the related promotional activities, almost all business units conducted advanced load testing and drills on their services. They also adjusted their capacity based on the drill results. Additionally, this year, the group's product launch event was scheduled close to Double 11, which made everyone pay even more attention to this event. As a result, they allocated larger buffer space and expanded their infrastructure for a safer and more significant capacity increase. Everything was in place, waiting for the launch event to begin.

The Scene of the Incident:

After thorough preparations, the entire group eagerly entered the live broadcast segment of the launch event. During the live sales and promotional phase, due to a sudden surge in traffic and a significant increase in data volume, error messages started to grow exponentially. The primary tool for monitoring application performance, "oz-hera," used by the group, began to show abnormal service metrics. With the influx of traffic, the number of error messages continued to increase, and the amount of erroneous data being queried also grew rapidly. Since Prometheus API does not support pagination and limit restrictions, it attempted to fetch all metric data at once, causing a data overload that exceeded the page's loading capacity, ultimately leading to a page crash.

Investigation Data:

  1. Prometheus raw data query time - approximately 50 seconds.
  2. Backend data query time exceeded 1 minute in total.
  3. Page loading took more than 2 minutes, and the page failed to load properly, resulting in a crash. Due to the extended delay in viewing the data, multiple individuals continuously attempted to refresh the page, increasing the query load on Prometheus, which further slowed down the query speed, and the page waiting time continued to increase.

Problem Analysis

Upon observation, it was found that due to the lack of pagination, the backend did not impose an overall data size limit, and the frontend also lacked segmented loading design. This resulted in loading an excessive amount of data in a single request, ranging from 6,000 to 10,000+ records. A rough estimate suggests that the data sent to the frontend may reach gigabytes in size, ultimately leading to extended waiting times and page crashes. Conclusion: The main reasons for the page crash during queries are as follows:

  1. The excessive volume of retrieved data caused slowness during the Prometheus service query. After querying the data, it takes time to convert the metric data into a format that the frontend can display, further extending the waiting time for queries.
  2. The query results are too large, leading to an extended time for frontend parsing and data loading, excessive memory consumption, and browser crashes.

Discussion on Problem Solving Approach

Based on the analysis in section III, the goal of solving the problem is to improve the speed of metric queries and reduce the size of query results, which will effectively enhance page loading speed.

Introduction of Problematic Metrics: To improve query speed, let's first look at the problematic metric: "dubboProviderSLA," which records the volume of abnormal data in dubbo service server calls. To better understand the meaning of this metric, we need to consider measuring the service quality of a dubbo service not only based on local errors because often, in high load situations, client requests may fail even before reaching the server, or they may reach the server but time out before receiving a response. In such cases, relying solely on the server's metrics is insufficient to reflect the actual problem. Therefore, there's a need for a metric that better measures the availability of dubbo services from the client's perspective. We define this metric as "dubboProviderSLA."

The characteristics of the metric "dubboProviderSLA" are as follows: As an objective measure of the availability of dubbo services from the client's perspective, "dubboProviderSLA" naturally requires recording client-related information. This information may include the client application's ID, name, environment, and the instance IP of the client application.

Root Cause Analysis of the Problem: Under normal circumstances, if the server experiences errors in 5 methods, and if the service is a commonly used foundational service with 100 service invocations, each client service having 50 instances, based on Prometheus's label data format, the total data generated would be 25,000 records. However, during major promotional events, the primary factor causing data accumulation is the number of service instances, specifically the client IP. During these peak promotional periods, in order to ensure effective service capabilities under the impact of high external traffic, there is often a need to scale up, frequently resulting in dozens or even more than 100 instances, leading to an exponential increase in data volume.

Optimization Plan:

Optimization Strategy: Based on the analysis mentioned above and after collective discussions within the team, it has been agreed that the primary factor leading to the surge in data is the "client instance IP" attribute label. This is the first key element to optimize. However, for a more effective optimization, we have also optimized the way server IPs are handled. By default, we do not group server IPs. This further narrows down the data to a smaller dimension, effectively reducing the total data volume. Finally, for added precaution, we have imposed a maximum limit on the data returned to the frontend, allowing a maximum of 1000 data records to be sent. This helps prevent extreme cases where an excessively large data volume could crash the frontend browser. Summarizing the optimization strategy: Optimization 1: Merge the "client IP" dimension, no longer displaying "client IP" in the frontend; specific information can be viewed in details. Optimization 2: Merge the "server IP" dimension, and by default, do not display server IP addresses. The frontend page will be redesigned, and the backend will provide a server IP list as a selection criterion for querying, allowing a maximum of 10 selections. Optimization 3: For metric queries, perform precise queries through grouped queries, no longer reusing labels for multiple metric queries. The previous "sumBy" code: The grouping accommodated multiple metric queries, including HTTP, Dubbo, Apus, Thrift, Redis, DB, and other metrics.

private String sumSumOverTimeFunc(String source) {

    StringBuilder sb = new StringBuilder();
    sb.append("sum(sum_over_time(");
    sb.append(source);
    sb.append(")) by (serverIp,job,application,methodName,serviceName,dataSource,sqlMethod,sql,serverEnv,serverZone,containerName,method,clientProjectId,clientProjectName,clientEnv,clientIp) ");
    return sb.toString();
}

The optimized "sumBy" code is as follows: where each metric is precisely grouped by its label using "sumBy," aiming to improve the speed of Prometheus service queries:

private String sumSumOverTimeFunc(String source,String metric,String sumBy) {

    StringBuilder sb = new StringBuilder();
    sb.append("sum(sum_over_time(");
    sb.append(source);
    sb.append(")) ");
    if (StringUtils.isNotBlank(sumBy)) {
        sb.append(" by (").append(sumBy).append( ")");
    }else {
        switch (metric) {
            case "dubboProviderSLAError":
                sb.append(" by (application,methodName,serviceName,serverEnv,serverZone,clientProjectName,clientEnv) ");
                break;
            case "dubboConsumerError":
                sb.append(" by (serverIp,application,methodName,serviceName,serverEnv,serverZone) ");
                break;
            case "dubboProviderError":
                sb.append(" by (serverIp,application,methodName,serviceName,serverEnv,serverZone) ");
                break;
            case "httpError":
                sb.append(" by (serverIp,application,methodName,serviceName,serverEnv,serverZone) ");
                break;
            case "httpClientError":
                sb.append(" by (serverIp,application,methodName,serviceName,serverEnv,serverZone) ");
                break;
            case "redisError":
                sb.append(" by (serverIp,application,method,serverEnv,serverZone) ");
                break;
            case "dbError":
                sb.append(" by (serverIp,application,dataSource,sqlMethod,sql,serverEnv,serverZone) ");
                break;
            case "grpcClientError":
                sb.append(" by (serverIp,application,methodName,serviceName,serverEnv,serverZone) ");
                break;
            case "grpcServerError":
                sb.append(" by (serverIp,application,methodName,serviceName,serverEnv,serverZone) ");
                break;
            case "thriftServerError":
                sb.append(" by (serverIp,application,methodName,serviceName,serverEnv,serverZone) ");
                break;
            case "thriftClientError":
                sb.append(" by (serverIp,application,methodName,serviceName,serverEnv,serverZone) ");
                break;
            case "apusServerError":
                sb.append(" by (serverIp,application,methodName,serviceName,serverEnv,serverZone) ");
                break;
            case "apusClientError":
                sb.append(" by (serverIp,application,methodName,serviceName,serverEnv,serverZone) ");
                break;
            case "oracleError":
                sb.append(" by (serverIp,application,dataSource,sqlMethod,sql,serverEnv,serverZone) ");
                break;
            case "elasticsearchClientError":
                sb.append(" by (serverIp,application,dataSource,sqlMethod,sql,serverEnv,serverZone) ");
                break;

            default:
                sb.append(" by (serverIp,application,methodName,serviceName,dataSource,sqlMethod,sql,serverEnv,serverZone,containerName,method,clientProjectId,clientProjectName,clientEnv) ");
        }
    }

Optimization 4: Frontend Support for Server IP Filtering

  1. First, the backend provides a list of error IPs within the specified time range, allowing the frontend to use them as multi-select query conditions for filtering.
  2. For the frontend, when querying "dubboProviderSLA" metrics, the server IP field is not loaded by default. After selecting server IP from the list, the results page will display the server IP again.

Optimization 5: Limiting Maximum Data Load

This is a fallback lossy optimization. If after merging clientIp and serverIp, the volume of errors remains enormous, we limit the data sent to the frontend to prevent page crashes. We set a maximum data limit of 1000 records for the frontend, and data beyond 1000 records is not loaded onto the page. The feasibility of this approach is based on the following reasons:

  1. Currently, it is estimated that after merging clientIp and serverIp, the normal error range falls within the range of n10 to n100. If it exceeds 1000, it's mainly due to the presence of IPs. Considering the scattering of the same errors on different IPs, having just one IP is sufficient for problem analysis and guidance. Therefore, losing some IP instances is feasible.
  2. If, even after removing clientIp and serverIp, the error volume still exceeds 1000, it indicates that the data accumulation of repeated dimensions is in the downstream service dimension, as the service methods on the server side usually won't reach 1000. In this case, it does not affect the guidance value for troubleshooting on the server side.

Adjustments Involved in this Optimization: After optimizing the page, the display of the serverIp dimension is removed by default. When navigating from the alert to the metric page, the frontend needs to accommodate server IP lists, and checkboxes must be selected. Queries are performed based on serverIp selection.

Optimization Results:

After the optimization, the data loading volume on the page has been significantly reduced. Normal data queries are now completed within 1 second, and even when dealing with larger data volumes, loading typically remains under 3 seconds.