ClickHouse / clickhouse-connect

Python driver/sqlalchemy/superset connectors
Apache License 2.0
325 stars 64 forks source link

Error to decode buffer data when querying large amount of data. #347

Closed thanhdanh1803 closed 4 weeks ago

thanhdanh1803 commented 5 months ago

Describe the bug

When querying large amount of data using client.query_df I got this error:

 337 def query_df(self,
    338              query: Optional[str] = None,
    339              parameters: Optional[Union[Sequence, Dict[str, Any]]] = None,
   (...)
    350              external_data: Optional[ExternalData] = None,
    351              use_extended_dtypes: Optional[bool] = None):
    352     """
    353     Query method that results the results as a pandas dataframe.  For parameter values, see the
    354     create_query_context method
    355     :return: Pandas dataframe representing the result set
    356     """
--> 357     return self._context_query(locals(), use_numpy=True, as_pandas=True).df_result

File ~/.local/lib/python3.8/site-packages/clickhouse_connect/driver/client.py:781, in Client._context_query(self, lcls, **overrides)
    779 kwargs.pop('self')
    780 kwargs.update(overrides)
--> 781 return self._query_with_context((self.create_query_context(**kwargs)))

File ~/.local/lib/python3.8/site-packages/clickhouse_connect/driver/httpclient.py:223, in HttpClient._query_with_context(self, context)
    221 byte_source = RespBuffCls(ResponseSource(response))  # pylint: disable=not-callable
    222 context.set_response_tz(self._check_tz_change(response.headers.get('X-ClickHouse-Timezone')))
--> 223 query_result = self._transform.parse_response(byte_source, context)
    224 query_result.summary = self._summary(response)
    225 return query_result

File ~/.local/lib/python3.8/site-packages/clickhouse_connect/driver/transform.py:69, in NativeTransform.parse_response(source, context)
     66     block_num += 1
     67     return result_block
---> 69 first_block = get_block()
     70 if first_block is None:
     71     return NumpyResult() if context.use_numpy else QueryResult([])

File ~/.local/lib/python3.8/site-packages/clickhouse_connect/driver/transform.py:40, in NativeTransform.parse_response.<locals>.get_block()
     37 for col_num in range(num_cols):
     39     name = source.read_leb128_str()
---> 40     type_name = source.read_leb128_str()
     41     if block_num == 0:
     42         names.append(name)

File clickhouse_connect/driverc/buffer.pyx:200, in clickhouse_connect.driverc.buffer.ResponseBuffer.read_leb128_str()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa2 in position 1: invalid start byte

Using the same query but with smaller number of rows, it works properly.

When print out the first 50 bytes in the buffer I got:

I found that there were some addition bytes in the beginning of data but the library does not handle them which might make the current error.

Configuration

Environment

ClickHouse server

genzgd commented 5 months ago

Thanks for the detailed report. How large is the dataset you're testing with? Is it possible to query the same data on an older ClickHouse version? That initial binary response definitely looks too long for what I'd expect for the native format.

genzgd commented 5 months ago

Also do you get the same error when just calling query instead of query_df

thanhdanh1803 commented 5 months ago

Thanks for the detailed report. How large is the dataset you're testing with? Is it possible to query the same data on an older ClickHouse version? That initial binary response definitely looks too long for what I'd expect for the native format.

Hi Geoff, The large dataset has over 17M rows and 116 columns, the smaller dataset has around 700k rows. And I get the same error when calling client.query. We ran this query before using ClickHouser server version 23.2.5.64 and click house-connect version 0.5.24 and it worked correctly. Hope you have more information to fix this issue. Thanks

genzgd commented 5 months ago

Thanks for the additional information. I'm guessing this is some change in Native format related to the very large dataset and the later ClickHouse version, so I'll check with the main ClickHouse team to see if they can help track it down.

genzgd commented 5 months ago

Can you try one more thing? I want to ensure that the error is not compression related, so if you create the client with compression: None, does it still fail?

genzgd commented 5 months ago

You can also try compression: zstd (lz4 is the default if nothing else is specified).

thanhdanh1803 commented 5 months ago

Hi Geoff, thanks for your suggestions. Due to the above error, we are using the no-compression mode, it works well but without the compression, the server consumes CPU more than compression enabled. With zstd, I got the same error. Hope above information can help you to figure out what the problem is.

genzgd commented 5 months ago

I have not been able to reproduce this locally despite very large datasets. There was a significant refactor of the ClickHouse HTTP server code a few months ago, so that could be related, but without a reproducible example the core ClickHouse team has not been able to provide any insight.

What compression library versions do you have installed? On my local tests I'm using lz4 4.3.3 (https://github.com/python-lz4/python-lz4), and zstandard 0.22.0 (https://github.com/python-lz4/python-lz4). I use a Mac with an ARM processor.

Is it possible for you to test the query on a newer/different operating systems and/or Python versions? Given that it seems to be compression related, this could be related to an underlying compression library.

Unfortunately I'm otherwise out of ideas for debugging the problem.

thanhdanh1803 commented 5 months ago

Thanks for your additional information. I have debugged deeper into the response message. I found that the client check the header "content-encoding" to select the decompression type, but my response header did not contain this key. This is my response header:

HTTPHeaderDict({'Date': 'Wed, 22 May 2024 07:33:00 GMT', 'Connection': 'Keep-Alive', 'Content-Type': 'application/octet-stream', 'X-ClickHouse-Server-Display-Name': 'xxx', 'Transfer-Encoding': 'chunked', 'X-ClickHouse-Query-Id': '6956337e-7da9-44f5-a0a9-3b25f7dcfde5', 'X-ClickHouse-Format': 'Native', 'X-ClickHouse-Timezone': 'Asia/Ho_Chi_Minh', 'Keep-Alive': 'timeout=3', 'X-ClickHouse-Progress': '{"read_rows":"813667693","read_bytes":"335632151771","written_rows":"0","written_bytes":"0","total_rows_to_read":"1817404885","result_rows":"0","result_bytes":"0","elapsed_ns":"120000918848"}, {"read_rows":"1676029161","read_bytes":"691891474011","written_rows":"0","written_bytes":"0","total_rows_to_read":"1817404885","result_rows":"0","result_bytes":"0","elapsed_ns":"240001289418"}', 'X-ClickHouse-Summary': '{"read_rows":"1817404885","read_bytes":"750221547755","written_rows":"0","written_bytes":"0","total_rows_to_read":"1817404885","result_rows":"17370406","result_bytes":"9343901668","elapsed_ns":"298767677461"}'})

Due to missing the header, the raw data in response was pass through the transforming parse_response function and the error appeared.

If I try too add the compress header to the response (manually), I would get the expected data: This is the first 50 bytes in raw response: b'\x04"M\x18@Pw\xb1$\x02\x00\xf2 \x01\x00\x02\xff\xff\xff\xff\x00t\xbf\xf4\x03\x11serving_timestamp\x06UInt64' This is the first 40 bytes after calling lz4 decompress function: b'\x01\x00\x02\xff\xff\xff\xff\x00t\xbf\xf4\x03\x11serving_timestamp\x06UInt64#\x98\xe5r\x8f\x01\x00\x00\xcc\x99\xe5r\x8f' (which are the same as returning data without compression) Do you have any ideal about this error. I think it may related to the response format from Clickhouse server. This is how I initialize my clickhouse client:

client = get_client(
            host= "xxx",
            port="8123",
            database="xxx",
            username="xxx",
            password="xxx",
            compress="lz4",
    )

This is the compression setting on server:

<!--
 Default transport compression type (can be overridden by client, see the transport_compression_type field in QueryInfo).
             Supported algorithms: none, deflate, gzip, stream_gzip 
-->
<transport_compression_type>none</transport_compression_type>
thanhdanh1803 commented 5 months ago

Oh, I think I have figured out the problem, it is the missing of enable_http_compression on the server, that why the client never add the Accept-Encoding into the request header. Thank for your support. But the next question is why the server still compress the returning data when the dataset is too large :D that is so weird.

genzgd commented 5 months ago

Again thanks for the detailed investigation.

If I understanding this right, you are getting compressed data even if there is not an Accept-Encoding header in the request header and no Content-Encoding header in response? That would be a significant ClickHouse bug.

When you say it is "missing enable_http_compression" on the server, clickhouse-connect tries to send that setting with every request if compression is enabled. Does your user not have permission to change that setting?

thanhdanh1803 commented 5 months ago

Again thanks for the detailed investigation.

If I understanding this right, you are getting compressed data even if there is not an Accept-Encoding header in the request header and no Content-Encoding header in response? That would be a significant ClickHouse bug.

-> That true, due to missing the enable_http_compression in the server config, Accept-Encoding is never existed in request header.

When you say it is "missing enable_http_compression" on the server, clickhouse-connect tries to send that setting with every request if compression is enabled. Does your user not have permission to change that setting?

-> This is our missing when setup the new server, I'm asking for the DevOps guy to enable that config, I will update the result soon right after the config is set.

genzgd commented 5 months ago

I'm still confused by one thing -- you said this worked if you used "no-compression" mode. What did you mean by that? If you set compression=None in the client that should be identical to the situation where enable_http_compression is disabled.

thanhdanh1803 commented 5 months ago

I'm still confused by one thing -- you said this worked if you used "no-compression" mode. What did you mean by that? If you set compression=None in the client that should be identical to the situation where enable_http_compression is disabled.

I set the compress param to False in this case. I will clarify what I have tested: Without enable_http_compression in server config

genzgd commented 5 months ago

Thanks -- Just to clarify, was Accept-Encoding specified in the request header for case 2? "Client with compress='lz4' and enable_http_compression set to 1 in the server config"?

The last case is pretty clearly a ClickHouse bug but ideally before opening an issue I would have a reproducible example.

thanhdanh1803 commented 5 months ago

Thanks -- Just to clarify, was Accept-Encoding specified in the request header for case 2? "Client with compress='lz4' and enable_http_compression set to 1 in the server config"?

The last case is pretty clearly a ClickHouse bug but ideally before opening an issue I would have a reproducible example.

Yes, I added a debugger breakpoint at that line and confirmed Accept-Encoding was lz4

genzgd commented 5 months ago

Okay, I'll try to create a reproducible example. I could not do that on a large dataset against 24.3.2 using a fairly simple query -- are there any special or complex columns or calculations in the "large" query that breaks? In any case I'll try some different combinations over the weekend, and maybe look into the new ClickHouse C++ HTTP code to determine why the Content-Encoding header is being missed.

den-crane commented 4 months ago

try to disable send_progress_in_http_headers

genzgd commented 4 weeks ago

This has been fixed in the main ClickHouse project. See the discussion in the issue Denny referenced, https://github.com/ClickHouse/ClickHouse/issues/64802, and the recent PR https://github.com/ClickHouse/ClickHouse/pull/68975.

praveenNference commented 2 weeks ago

@thanhdanh1803 Try updating your clickhouse-connect version to 0.8.2. We had the same issue up until 0.8.1. This worked for me when I used 0.8.2.

There is no need to disable compress, it can remain the default (which is enabled).

P.S. This is a relatively newer release. Release notes: link

@den-crane send_progress_in_http_headers is by default disabled. (#Ref)

genzgd commented 2 weeks ago

@praveenNference Thanks for reporting this! I didn't think about this issue in connection with the fix in 0.8.2, but I could see how it could be related.