Altinity / clickhouse-operator

Altinity Kubernetes Operator for ClickHouse creates, configures and manages ClickHouse® clusters running on Kubernetes
https://altinity.com
Apache License 2.0
1.94k stars 466 forks source link

TCP Reset connection when replicating data due to full tcp_rmem buffer #748

Open ansou-naboty opened 3 years ago

ansou-naboty commented 3 years ago

Hi my name is Ansou FALL and im working for opensee.io as a Devops engineer. I'm running a clickhouse operator in AKS(Azure Kubernetes Service). We are facing tricky issue issue on clickhouse when replicating data in shard between others replicas. Here is the configuration we have: clusterName: standard ClickHouseInstallationName: statefulset Shards: 8 Replicas: 2 Nodecounts: 16 Total of node: 16 nodes with 64 GB of memory ram and 16 CPU each node. We have 48 Instances and each of them insert 500K lines every 1-2 seconds.

During ingestion part the tcp_rmem buffer are full and the TCP connection between repication are closed.

The image shown below lists RST flag that close TCP connection.

Reset_all

This image describes the RST flag when replicating data between replicas in other shard.

Reset_interserver_9009

Reset_native_port

This image lists the some TCP full windows during ingestion. window_full_window_1

Slach commented 3 years ago

@ansou-naboty clickhouse-operator don't make replication data itself so, look like the issue is not related to clickhouse-operator but related to clickhouse itself

could you look to /var/log/clickhouse-server/clickhouse-server.err.log inside your related clickhouse server pods for time period which you share in your network dump?

ansou-naboty commented 3 years ago

I know that clickhouse-operator doesn't make replication data itself, but clickhouse itself do it, or should i post this issue on clickhouse github.

Slach commented 3 years ago

@ansou-naboty do you see anything related to your RST in /var/log/clickhouse-server/clickhouse-server.err.log?

ansou-naboty commented 3 years ago

In /var/log/clickhouse-server/clickhouse-server.err.log, i see connection reset by peer while reading or writing one socket xxxxxx. With tcpdump it corresponds to TCP window full read buffer.