apache / rocketmq

Apache RocketMQ is a cloud native messaging and streaming platform, making it simple to build event-driven applications.
https://rocketmq.apache.org/
Apache License 2.0
21.24k stars 11.69k forks source link

[Bug] Response should not be discarded even the channel is unwritable #8159

Open redlsz opened 6 months ago

redlsz commented 6 months ago

Before Creating the Bug Report

Runtime platform environment

centos

RocketMQ version

develop

JDK Version

1.8

Describe the Bug

org.apache.rocketmq.remoting.netty.NettyRemotingServer.NettyServerHandler

image

org.apache.rocketmq.proxy.remoting.activity.AbstractRemotingActivity#writeResponse

image

Response should not be discarded even the channel is temporarily unwritable. Instead, we should make sure to write response to buffer and pause channel reading.

Steps to Reproduce

/

What Did You Expect to See?

/

What Did You See Instead?

/

Additional Context

No response

drpmma commented 5 months ago

I understand that the purpose of isWriteable is primarily to protect the server from memory overflow and to do backpressure, acting as a sort of fail-fast mechanism. It serves a definite purpose, and I am more inclined towards keeping this piece of logic.

lizhimins commented 5 months ago

Not responding mainly aims to prevent backpressure from causing server-side OOM errors. Furthermore, there’s a safeguard in NettyServerHandler#channelWritabilityChanged to ensure that an excess of responses isn’t dispatched (a similar approach is adopted by other products). In most cases, connections originate from the client. By not reading and processing requests from the client, the server can remove the channel through an application layer schedule task scan. From the above issue, it's not entirely clear what problem this is intended to solve. Is it possible to adjust the server watermark to address this?

不写回 response 主要就是为了防止反压导致服务端 oom,然后 NettyServerHandler#channelWritabilityChanged 这里还有个保护(其他产品也是类似的做法)保证不会丢到过多的 response。大多数情况下,连接来自客户端,不处理来自客户端的请求之后,服务端可以通过应用层 scan channel 来移除 channel,同时释放内存。从上述 issue 没太理解是为了解决什么问题,是否可以调整 server watermakr 来解决?

redlsz commented 5 months ago

@drpmma @lizhimins I think that the custom NettyServerHandler#channelWritabilityChanged can achieve the purpose of back-pressure. Once the high watermark of write buffer is reached, reading data from this channel will be suspended, which is expected to prevent the buffer from growing indefinitely.

lizhanhui commented 5 months ago

This is a place where we can further refine instead of simple keep-or-discard binary decision:

  1. Consider the request timestamp and deadline as we may have suspend read for some time: if the request has missed its deadline, request and response should be discarded accordingly;
  2. Channel has exposed a method Channel#bytesBeforeWritable(), if the bytes is just over the watermark and less than configurable max-bytes-per-connection, we should keep the response to improve QoS in case of temporary client-side busy, GC for instance;
  3. If the outbound inflight bytes within the Channel has exceeds hard limit of the channel buffer, treat it as Amplifciation DDoS Attack and discard the response;