[BUG] OOM when clickhouse is slow and a lot of insert queries are sent

ContentSquare / chproxy

Open-Source ClickHouse http proxy and load balancer

https://www.chproxy.org/

MIT License

1.24k stars 256 forks source link

[BUG] OOM when clickhouse is slow and a lot of insert queries are sent #428

Open bzed opened 2 months ago

bzed commented 2 months ago

Describe the bug We regularly see the following issue:

clickhouse not being as fast as expected due to high load (not nice, things to optimize but these things can happen)
chproxy receiving lots of inserts from applications without being able to forward them in time. It happily accepts hundreds of connections in parallel...
chproxy being OOM killed
chproxy restarted, introducing even more load on the clickhouse server due to new connections....

To Reproduce

make clickhouse slow
put chproxy under load with lots of big parallel inserts.

Expected behavior No OOM. Better memory handling. Cancel connections or let them wait before running OOM.

Environment information chproxy v1.26.2 clickhosue 24.3.2.23

vfoucault commented 2 months ago

hi there, this is a tricky issue. I don't really see a positive outcome here rather than using a rate limiter for your inserter and handle the back pressure at the data producer level.

Another option would be to add way more memory to your chproxy, or to bypass chproxy for data insertion or to make clickhosue faster 😅

No miracle would happen here.

bzed commented 2 months ago

Yes, indeed a tricky issue. Rate limiting in front of chproxy is (much stricter) in place now. But still, imho is a program running into OOM a bug :) Just adding more resources will just move the point where the oom will happen. To solve this bug I think a completely different memory management would be needed, but yes, its not trivial as not all connection need the same amount of memory.

mga-chka commented 2 months ago

Unfortunately, we (contentsquare) don't use chproxy to insert data. This feature has been done by the previous maintainers (Vertamedia) and we don't maintain it anymore. If it was happening on select queries, we might do something (but from what I remember, the query results are either streamed or put in temporary files to avoid an OOM in such situation). But since it's about insert queries, feel free to make a PR to fix the issue. As Vianney said, it will be tricky to solve it, and you should use a rate limiter to make sure it can't happen, for example by using the max_concurrent_queries parameter

Frank030366 commented 2 days ago

@mga-chka - I've experienced the same issues as author described: Chproxy catches OOM under heavy INSERT load with large batches. So I've made some tests and can shed some light on nature of this bug - it seems that this issue was introduced by changes in 1.22.0 release because 1.21.0 works stable in our environment but 1.22 OOM killed after ~10-20 seconds after starting workload. At least two changes probably introduced this bug: #299 and #296. To test it I've built custom binary from 1.22 sources with that changes reverted and it works stable under our load. But original 1.22 binary and the latest version binary are OOM killed.

One of possible root causes - maybe it's not efficient to load every incoming request body for possible retry because it can be very huge for INSERT like workload.