About the bug of Socket::enableRecv used in epoll edge mode

ss002012 commented 3 months ago

First, let's talk about the usage scenario:

I'm using ZLToolKit to implement a TCP proxy, which means creating an intermediate TCP server. Based on the client's request content, I then pass the request content (as an intermediate TCP client) to the real backend TCP service. Then, I pass the return data from the backend TCP service to the requesting client. The flow is: tcp client----(server)tcp proxy(client)-----tcpserver.

Sometimes, the backend tcpserver transmits data very quickly, while the client processes it very slowly. Therefore, if I directly use send to forward the data, it will lead to a significant increase in memory usage. My solution is to check how much data has not been sent to the client. If it exceeds a certain amount, I call Socket::enableRecv(false) to stop listening for reads from the tcpserver. If it's less than a certain amount, I enable it again. However, during usage, I found that after several calls to Socket::enableRecv(false)/Socket::enableRecv(true), I can no longer receive data from the backend tcpserver.

Now, I'm pretty sure it's an epoll edge-triggered issue. This is because the problem doesn't exist when I disable epoll and switch to select. Then, I tried changing the trigger mode of the used fd to level-triggered mode. However, I found that level-triggered mode cannot coexist with the EPOLLEXCLUSIVE flag in the EventPoller::addEvent function. Otherwise, the writable event will be triggered continuously and cannot be closed, leading to high CPU usage. Finally, I disabled EPOLLEXCLUSIVE in level-triggered mode, which initially solved the problem.

However, the above solution is not elegant, and level-triggered mode has lower performance than edge-triggered mode. But I can't solve the problem in edge-triggered mode. I'm asking for guidance from the experts.

先说下使用场景：我在用ZLToolKit实现一个tcp proxy即创建一个中间tcp服务端，根据客户端请求的内容再把请求内容（作为一个中间tcp客户端）传递给真正的后端tcp服务，然后把后端tcp服务的返回数据传给请求客户端，tcp client----(server)tcp proxy(client)-----tcpserver。有时后端tcpserver传递的速度很快，而客户端处理的很慢，所以直接调用send转发的话会导致内存暴增，我的处理是检测还有多少数据未发送给客户端，如果超过一定量则调用Socket::enableRecv(false)来停止对tcpserver的读监听，小于一定量再开启。但在使用中发现经过几次Socket::enableRecv(false)/Socket::enableRecv(true)的调用后，后面接受不到后端tcpserver的数据。

现在已经基本确定是epoll 边沿触发的问题了，因为首先禁用epoll改为select后就不存在这个问题；之后我尝试将用到的fd的触发模式改为水平模式触发，但发现在EventPoller::addEvent函数中水平模式不能和EPOLLEXCLUSIVE标识同时存在，否则可写一直触发无法关闭导致cpu占满，最后水平模式下把EPOLLEXCLUSIVE禁用掉才初步解决这个问题。但上面的解决不优雅，且水平模式比边沿模式性能低，但边沿模式下我这边无法解决，请求大佬指导。

TRANS_BY_GITHUB_AI_ASSISTANT

xia-chu commented 3 months ago

This issue does exist. You can manually trigger Socket::onRead when receiving is enabled.

A PR can be created to fix this issue.

这个问题确实纯在可以在开启接收时再手动触发下Socket::onRead 可以来个pr修复这个问题

TRANS_BY_GITHUB_AI_ASSISTANT

ss002012 commented 3 months ago

Thanks, I'll try the boss's solution first.

多谢，我先试试大佬的方案。

TRANS_BY_GITHUB_AI_ASSISTANT

ss002012 commented 3 months ago

We tested this solution on our cloud environment and it works, but our code is over a year old and differs from the latest code implementation. The corresponding changes to the latest code should be as follows, but I need some time to verify them.

在我们云内环境上按照这个方案测试的确可行，但我们的代码是超过1年前的代码了，和最新代码实现有所不同；对应最新代码改动应该如下，但是我这边验证还需要一些点时间。

TRANS_BY_GITHUB_AI_ASSISTANT

PioLing commented 3 months ago

https://github.com/xia-chu/TcpProxy

xia-chu commented 3 months ago

境上按照这个方案测试的确可行，但我们的代码是超过1年前的代码了，和最新代码实现有所不同；对应最新代码改动应该如下，但是我这边验证还需要一些点时间。

I think this modification is fine, but KQUEUE seems to be edge-triggered as well.

境上按照这个方案测试的确可行，但我们的代码是超过1年前的代码了，和最新代码实现有所不同；对应最新代码改动应该如下，但是我这边验证还需要一些点时间。

我觉得这样修改没问题不过KQUEUE好像也是边缘触发

TRANS_BY_GITHUB_AI_ASSISTANT

ss002012 commented 3 months ago

Interesting phenomenon.
The code after the commit to prevent errors when modifying fd events will not have the problem we mentioned where reading is enabled but does not trigger. The code before the commit to prevent errors when modifying fd events commit, according to the above modification scheme, is usable under epoll (I don't have the conditions to test KQUEUE). So according to the latest code, no further modifications are needed. Unless we add EPOLLEXCLUSIVE back later.

有趣的现象。
在防止修改fd事件时报错commit之后的代码，不会存在我们说得开启读但读不会触发的问题。在防止修改fd事件时报错commit之前的代码，按照上面的修改方案测试epoll下是可用的(KQUEUE我没有条件测试)。所以按照最新代码来说，不再需要额外的修改。除非我们后面再把EPOLLEXCLUSIVE给加回来。

TRANS_BY_GITHUB_AI_ASSISTANT

ss002012 commented 3 months ago

https://github.com/xia-chu/TcpProxy

Thank you for providing this information. I will refer to it.

https://github.com/xia-chu/TcpProxy

谢谢提供的信息我参考下。

TRANS_BY_GITHUB_AI_ASSISTANT

xia-chu commented 3 months ago

改方案测试epoll下是可用的(KQUEUE我没有条件

Then I won't merge your changes for now.

改方案测试epoll下是可用的(KQUEUE我没有条件

那你这个修改我先不合入了

TRANS_BY_GITHUB_AI_ASSISTANT

ss002012 commented 3 months ago

"> > 改方案测试epoll下是可用的(KQUEUE我没有条件 " "> " "> 那你这个修改我先不合入了 "

OK, no need to merge now, the test is fine.

改方案测试epoll下是可用的(KQUEUE我没有条件

那你这个修改我先不合入了

OK，现在不需要合入，测试没有问题。

TRANS_BY_GITHUB_AI_ASSISTANT

ZLMediaKit / ZLToolKit

About the bug of Socket::enableRecv used in epoll edge mode #237