apache / rocketmq

Apache RocketMQ is a cloud native messaging and streaming platform, making it simple to build event-driven applications.
https://rocketmq.apache.org/
Apache License 2.0
21.05k stars 11.61k forks source link

[Enhancement] Find ways to reduce tiered storage module's GC pressure #8408

Open bxfjb opened 1 month ago

bxfjb commented 1 month ago

Before Creating the Enhancement Request

Summary

When tiered storage enabled, high GC pressure was shown under tens of thousands pub/sub TPS, which may cause full GC occasionally.

Motivation

Uncontrollable and unpredictable full GC may cause STW and disable the service.

Describe the Solution You'd Like

  1. Use off-heap cache instead of in-heap cache as read-ahead cache. Our gc logs show that the current cache eviction strategy causes a large number of objects to remain in the old generation. From the discussion in this link, caffiene cache is not designed for such high-load scenarios by Ben Manes himself.
  2. Another possible optimization may in the process of uploading indexfile. In the current design, the size of each compressed index is about 570MB. If the object storage SDK used to upload this much data at one time, it will inevitably be copied into the heap, which will also bring great pressure to JVM. It might be a good idea to upload the file in many parts.

Describe Alternatives You've Considered

In fact, the situation of the old generation during the normal upload process is also not very good. It can be seen that the size of the old generation fluctuates frequently, which is presumably caused by mixed GC. So there might be sth to do with the producing process. image

Additional Context

No response

ben-manes commented 1 month ago

It looks like your cache is holding SelectBufferResult, which wraps a ByteBuffer. From your description, it sounds like it’s non-direct (on-heap). Since they are long lived you might allocate them off heap as the cache itself being on heap shouldn’t be an issue. There’s quirks with direct byte buffers so they’re avoided until appropriate.

bxfjb commented 1 month ago

It looks like your cache is holding SelectBufferResult, which wraps a ByteBuffer. From your description, it sounds like it’s non-direct (on-heap). Since they are long lived you might allocate them off heap as the cache itself being on heap shouldn’t be an issue. There’s quirks with direct byte buffers so they’re avoided until appropriate.

You're right, the key is where the ByteBuffer is. The point is data source is object storage SDK which means placing data in the heap seems inevitable. Perhaps the data still needs to be copied off the heap?