alibaba / Sentinel

A powerful flow control component enabling reliability, resilience and monitoring for microservices. (面向云原生微服务的高可用流控防护组件)
https://sentinelguard.io/
Apache License 2.0
22.29k stars 7.99k forks source link

Under concurrent traffic, (possibly) 1. Current limit is not allowed. 2. All the requests are blocked. (Under concurrent traffic, rate limit may not correct, and block all request) #957

Open srctar opened 5 years ago

srctar commented 5 years ago

并发流量下,(可能)1.限流不准/2.拦截所有请求

Under concurrent traffic,rate limit may not correct, and block all request

并发流量, 限流可能不准(放过更多的流量,稳定复现)

中文: 在较大的流量压力(并发)下。限流可能不准确。 不准确的点在于, 获取entry责任链中, StatisticSlot 统计数据, 与FlowSlot获取数据,他们两并不是抢占式的。在我的电脑(i5 8400, 16G, MAC10.14.5)上,他们两大约有2ms左右的时间差,而这个时间差足够让后续的线程绕过限流FlowSlot插件了。

English: Under concurrent traffic, rate limit may not correct; the way for limit is a way of slot chain:

FlowSlot use StatisticSlot data. but they do not wait for a lock. it may cause FlowSlot use an old data. And pass a lot of traffic.

并发流量下, 部分的请求可能都会被block掉(通过的流量低于设定阈值, 仅出现在线程池调度的情况下)。

中文: 仅在线程池环境下, 该问题复现, 且稳定复现。 我看了很久,没有判断出原因是什么。 现状: 可能在持续长达5s以上的时间, 有流量进入,限流大小不小于1,所有请求被阻断。 English: only under thread pool, the request may be blocked by sentinel, all the request be blocked, even the limit is larger than 0.

这两个问题一般一起出现, 先出现第一个问题, 大约两三秒之后出现第二个问题。 第一个问题导致放过更多的流量, 第二个问题可能导致所有流量全部被block(第二个问题只出现在线程池环境下)。 the 2 case show together; the first one may cause more than limit traffic; the second may cause no traffic passed(the second case may only appear in thread pool );

复现方式(the way to reproduce it)

  1. 在主线程中不停的创建子线程, 子线程采用Sentinel推荐的写法(模拟tomcat环境) (create a lot of child thread, only like this)
    public static void main(String[] xxx) throws Exception {
        XXX x = new XXX();
        int j = 99999999;
        while (j-- > 0) {
            try {
                final int av = j;
                executor.execute(() -> x.打印一个SystemOut(av));
            } catch (Exception E) {
                E.printStackTrace();
            }
            // 当执行休眠的时候, 问题不再复现
            /*TimeUnit.MILLISECONDS.sleep(20L);*/
        }
        System.out.println("shut down");
    }
  2. 使用Sentinel推荐的编码方式 (each thread call the method, like sentinel told us)
  3. 在程序运行时, 开启限流。 (open limit on sentinel-dashboard)

Tell us your environment

mac os x 10.14.5, jdk8u225, eclipse

Anything else we need to know?

在 限流器 获取当前QPS的时候, 同步一下, 可解决这两个问题。

复现问题的全部代码:

import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Date;
import java.util.List;
import java.util.concurrent.Executor;
import java.util.concurrent.Executors;

import com.alibaba.csp.sentinel.Entry;
import com.alibaba.csp.sentinel.SphU;
import com.alibaba.csp.sentinel.slots.block.RuleConstant;
import com.alibaba.csp.sentinel.slots.block.flow.FlowRule;
import com.alibaba.csp.sentinel.slots.block.flow.FlowRuleManager;

public class ZZZ {

    SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");

    private static Executor executor = Executors.newFixedThreadPool(50);

    private static String pre;

    private static int size = 99990000;

    public void justPrint(int i) {
        try (Entry entry = SphU.entry("XXX.sysGood11")) {
            String print = sdf.format(new Date());
            if (!print.equalsIgnoreCase(pre)) {
                System.out.println();
                pre = print;
            }
            System.out.println(print + "\t" + i);
        } catch (Throwable e) {
        } finally {
        }
    }

    public static void main(String[] xxx) throws Exception {
        ZZZ x = new ZZZ();
        int j = size;

        initFlowQpsRule();

        while (j-- > 0) {
            try {
                final int p = j;
                executor.execute(() -> x.justPrint(p));
                /*new Thread(() -> x.justPrint(p)).start()*/
            } catch (Exception E) {
                E.printStackTrace();
            }
            /*TimeUnit.MILLISECONDS.sleep(20L);*/
        }
        System.out.println("shut down");
    }

    private static void initFlowQpsRule() {
        List<FlowRule> rules = new ArrayList<>();
        FlowRule rule = new FlowRule("XXX.sysGood11");
        // set limit qps to 5
        rule.setCount(5);
        rule.setGrade(RuleConstant.FLOW_GRADE_QPS);
        rule.setLimitApp("default");
        rules.add(rule);
        FlowRuleManager.loadRules(rules);
    }
}
sentinel-bot commented 5 years ago

Hi @srctar, we detect non-English characters in the issue. This comment is an auto translation from @sentinel-bot to help other users to understand this issue. We encourage you to describe your issue in English which is more friendly to other users.

Under concurrent traffic, (possibly) 1. Current limit is not allowed. 2. All the requests are blocked. (Under concurrent traffic, rate limit may not correct, and block all request)

Under concurrent traffic, (may) 1. Current limit is not allowed. 2. Intercept all requests.

Under concurrent traffic,rate limit may not correct, and block all request

Concurrent traffic, current limit may not be allowed (discharge more traffic, stable recurrence)

Chinese: Under a large flow pressure (concurrent). Current limit** may be inaccurate. The inaccuracy is that in the entry chain of responsibility, the StatisticSlot statistic, and the FlowSlot get the data, the two of them are not preemptive. On my computer (i5 8400, 16G, MAC10.14.5), they have a time difference of about 2ms, and this time difference is enough for subsequent threads to bypass the current limit FlowSlot plugin.

English: Under concurrent traffic, rate limit may not correct; the way for limit is a way of slot chain:

FlowSlot use StatisticSlot data. but they do not wait for a lock. it may cause FlowSlot use an old data. And pass a lot of traffic.

Under concurrent traffic, some of the requests may be blocked (the traffic passing through is lower than the set threshold, only in the case of thread pool scheduling).

Chinese: This problem reappears only in a thread pool environment and is stable and reproducible. I have been watching for a long time and have not judged what the reason is. Status: It may be that for a period of more than 5s, there is traffic entering, the current limit is not less than 1, and all requests are blocked. English: only under thread pool, the request may be blocked by sentinel, all the request be blocked, even the limit is larger than 0.

These two questions generally appear together. The first question first appears, and the second question occurs after about two or three seconds. The first problem caused more traffic to be dropped, and the second problem caused all traffic to be blocked (the second problem only occurred in the thread pool environment). the 2 case show together; the first one may cause more than limit traffic; the second may cause no traffic passed(the second case may only appear in thread pool );

回方式方式 (the way to reproduce it)

  1. Create child threads in the main thread, sub-threads use Sentinel recommended (simulated tomcat environment) (create a lot of child thread, only like this)
    public static void main(String[] xxx) throws Exception {
        XXX x = new XXX();
        int j = 99999999;
        while (j-- > 0) {
            try {
                final int av = j;
    Executor.execute(() -> x. print a SystemOut(av));
            } catch (Exception E) {
                E.printStackTrace();
            }
    // When performing hibernation, the problem no longer reappears
            /*TimeUnit.MILLISECONDS.sleep(20L);*/
        }
        System.out.println("shut down");
    }
  2. Use the encoding method recommended by Sentinel (each thread call the method, like sentinel told us)
  3. When the program is running, turn on the current limit. (open limit on sentinel-dashboard)

Tell us your environment

mac os x 10.14.5, jdk8u225, eclipse

Anything else we need to know?

When the current limiter gets the current QPS, it can solve these two problems by synchronizing.

Reproduce all the code for the question:

import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Date;
import java.util.List;
import java.util.concurrent.Executor;
import java.util.concurrent.Executors;

import com.alibaba.csp.sentinel.Entry;
import com.alibaba.csp.sentinel.SphU;
import com.alibaba.csp.sentinel.slots.block.RuleConstant;
import com.alibaba.csp.sentinel.slots.block.flow.FlowRule;
import com.alibaba.csp.sentinel.slots.block.flow.FlowRuleManager;

public class ZZZ {

    SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");

    private static Executor executor = Executors.newFixedThreadPool(50);

    private static String pre;

    private static int size = 99990000;

    public void justPrint(int i) {
        try (Entry entry = SphU.entry("XXX.sysGood11")) {
            String print = sdf.format(new Date());
            if (!print.equalsIgnoreCase(pre)) {
                System.out.println();
                pre = print;
            }
            System.out.println(print + "\t" + i);
        } catch (Throwable e) {
        } finally {
        }
    }

    public static void main(String[] xxx) throws Exception {
        ZZZ x = new ZZZ();
        int j = size;

        initFlowQpsRule();

        while (j-- > 0) {
            try {
                final int p = j;
                executor.execute(() -> x.justPrint(p));
                /*new Thread(() -> x.justPrint(p)).start()*/
            } catch (Exception E) {
                E.printStackTrace();
            }
            /*TimeUnit.MILLISECONDS.sleep(20L);*/
        }
        System.out.println("shut down");
    }

    private static void initFlowQpsRule() {
        List<FlowRule> rules = new ArrayList<>();
        FlowRule rule = new FlowRule("XXX.sysGood11");
        // set limit qps to 5
        rule.setCount(5);
        rule.setGrade(RuleConstant.FLOW_GRADE_QPS);
        rule.setLimitApp("default");
        rules.add(rule);
        FlowRuleManager.loadRules(rules);
    }
}
linlinisme commented 5 years ago

从结果看确实会出现额外流量会被放行的情况,但这种其实和StatisticSlot 统计数据, 与FlowSlot获取数据,他们两并不是抢占式的没有很大的关系。默认的slot调用链是FlowSlot之后再到StatisticSlot,所以StatisticSlot的统计是事后统计, 一个先一个后,两者没有竞争关系。另外一点就是FlowSlot读取的是StatisticSlot的是统计的平均数据,这个操作用了一个向下取整的操作。所以流量是算少了的,但这个影响是非常的少,只向时间窗口的大小有关系,具体可参与下面的代码的(int)(node.passQps());

 private int avgUsedTokens(Node node) {
        if (node == null) {
            return DEFAULT_AVG_USED_TOKENS;
        }
        return grade == RuleConstant.FLOW_GRADE_THREAD ? node.curThreadNum() : (int)(node.passQps());
    }
srctar commented 5 years ago

从结果看确实会出现额外流量会被放行的情况,但这种其实和StatisticSlot 统计数据, 与FlowSlot获取数据,他们两并不是抢占式的没有很大的关系。默认的slot调用链是FlowSlot之后再到StatisticSlot,所以StatisticSlot的统计是事后统计, 一个先一个后,两者没有竞争关系。另外一点就是FlowSlot读取的是StatisticSlot的是统计的平均数据,这个操作用了一个向下取整的操作。所以流量是算少了的,但这个影响是非常的少,只向时间窗口的大小有关系,具体可参与下面的代码的(int)(node.passQps());

 private int avgUsedTokens(Node node) {
        if (node == null) {
            return DEFAULT_AVG_USED_TOKENS;
        }
        return grade == RuleConstant.FLOW_GRADE_THREAD ? node.curThreadNum() : (int)(node.passQps());
    }

感谢回复。

我们表达了类似的事, 但是着眼点可能不一样。 二者是先后关系, 也确实没有竞争关系。 先者FlowSlot引用后者StatisticSlot统计的数据, 这样一来较高的并发下, 先者引用的数据将可能不准备。 因此我使用了抢占式这个词

针对时间窗口, 不确定是不是我看漏了, 秒级的实现是一前一后两个500ms的滑动窗口, 是固定值。 因此 rollingCounterInSecond.pass() / rollingCounterInSecond.getWindowIntervalInSec(); 的商是固定值已经通过的QPS数; 佐证上面的那个说法, 高并发下可能会有漏掉的流量。


第二个问题, 线程池环境下, 高并发流量。 会出现所有请求全部被截断的情况。 我想了很多原因, 也百思未得其解(我发的那段ZZZ代码就可以复现)。如果大神有空, 还请指教。