Closed keu closed 1 month ago
It's crashing because your connection is breaking: 'Connection broken: IncompleteRead(134 bytes read)'. This could be caused by something like an idle timeout anywhere in your chain. How much time is processing taking per chunk?
The average processing time for a single block is ~10 seconds. The connection breaks around 4-5 minute after the read started. The server and the client are on the same machine.
After digging into this I believe the problem is that ClickHouse server times out when pushing more data because the client has not read all of the data off the socket. When trying to reproduce this I get the following error in the ClickHouse logs:
DynamicQueryHandler: Code: 209. DB::Exception: Timeout exceeded while writing to socket ([::1]:63309, 30000 ms): While executing Native. (SOCKET_TIMEOUT)
The socket is still busy/full when ClickHouse tries to send the error:
DynamicQueryHandler: Cannot send exception to client: Code: 209. DB::NetException: Timeout exceeded while writing to socket ([::1]:63309, 30000 ms). (SOCKET_TIMEOUT)
The end result is that if reading data falls more than 30 seconds behind ClickHouse sending the data, ClickHouse will close the connection, causing the error you see.
There's not an easy fix directly in clickhouse-connect
because the point of streaming is to avoid reading all of the data at once into memory -- but if your processing is falling behind, the data sent from ClickHouse has no place to go. So the short term answer to your problem is to only query as much data from ClickHouse as you can process without falling behind more than 30 seconds. Unfortunately the HTTP interface in ClickHouse is not "smart" to keep the connection open and stream the data as it is being consumed.
However, in the next release I'm looking at adding an intermediate buffer with a configurable size to temporarily store the HTTP data until requested by the stream processing. So if your total query size is something like 50MB, and the new intermediate buffer is configured at 100MB, you should not have this issue. But there will definitely be a tradeoff between using the additional memory and ensuring that your connection isn't closed while processing.
@genzgd thank you for the investigation. I understand it more precisely now. So, the short-term fix for me would be something like this
client = clickhouse_connect.get_client(
...
settings={"max_block_size": 30 / seconds_per_row}
)
max_block_size
isn't going to reduce the total amount of data you're consuming, and you'll still fall behind as ClickHouse tries to push data faster than you can process it. You either need faster processing or to read less data in the same amount of time by using queries that return less data.
I see. Unfortunately, I don't see an easy way to make processing faster without going into buffering and multithreading, which IMO probably will cause more troubles in future. So the only option I see for now is to save data to disk and then read from it. AFAIK Clickhouse doesn't have an analog to Postgres server-side cursor...
Unfortunately, no, there is no server side cursor. If you can break your query up into chunks based on the primary key, you could read each chunk into memory (using just the client query
method instead of streaming), then process each chunk, and then get the next one. Some version of that is going to be necessary unless you do something to match your processing speed with the amount of data ClickHouse is sending.
Describe the bug
The streaming read is crashing after some time if there any processing in between reads.
Steps to reproduce
Expected behaviour
I expect it to read the whole dataset. If I disable processing, the dataset is read fine (40 mln records). This leads me to think that it is not related to actual data and response but something inside the implementation.
Code example
clickhouse-connect and/or ClickHouse server logs
Configuration
Environment
ClickHouse server
CREATE TABLE
statements for tables involved: