Open bzed opened 2 months ago
hi there, this is a tricky issue. I don't really see a positive outcome here rather than using a rate limiter for your inserter and handle the back pressure at the data producer level.
Another option would be to add way more memory to your chproxy, or to bypass chproxy for data insertion or to make clickhosue faster 😅
No miracle would happen here.
Yes, indeed a tricky issue. Rate limiting in front of chproxy is (much stricter) in place now. But still, imho is a program running into OOM a bug :) Just adding more resources will just move the point where the oom will happen. To solve this bug I think a completely different memory management would be needed, but yes, its not trivial as not all connection need the same amount of memory.
Unfortunately, we (contentsquare) don't use chproxy to insert data. This feature has been done by the previous maintainers (Vertamedia) and we don't maintain it anymore.
If it was happening on select queries, we might do something (but from what I remember, the query results are either streamed or put in temporary files to avoid an OOM in such situation). But since it's about insert queries, feel free to make a PR to fix the issue. As Vianney said, it will be tricky to solve it, and you should use a rate limiter to make sure it can't happen, for example by using the max_concurrent_queries
parameter
@mga-chka - I've experienced the same issues as author described: Chproxy catches OOM under heavy INSERT load with large batches. So I've made some tests and can shed some light on nature of this bug - it seems that this issue was introduced by changes in 1.22.0 release because 1.21.0 works stable in our environment but 1.22 OOM killed after ~10-20 seconds after starting workload. At least two changes probably introduced this bug: #299 and #296. To test it I've built custom binary from 1.22 sources with that changes reverted and it works stable under our load. But original 1.22 binary and the latest version binary are OOM killed.
One of possible root causes - maybe it's not efficient to load every incoming request body for possible retry because it can be very huge for INSERT like workload.
Describe the bug We regularly see the following issue:
To Reproduce
Expected behavior No OOM. Better memory handling. Cancel connections or let them wait before running OOM.
Environment information chproxy v1.26.2 clickhosue 24.3.2.23