ClickHouse / clickhouse-java

ClickHouse Java Clients & JDBC Driver
https://clickhouse.com
Apache License 2.0
1.45k stars 537 forks source link

ResultSet executeQuery(Map<ClickHouseQueryParam, String> additionalDBParams, List<ClickHouseExternalData> externalData) throws SQLException; #1370

Closed rhy429081492 closed 1 year ago

rhy429081492 commented 1 year ago

Describe the bug

I occasionally report an error when I use this function in JDBC to add external data in batches. Each error is caused by the format of the data sample. The sample position is different each time the error is reported. When I reduce the amount of data, it works. I have tried to use Wireshark to capture packets. It is found that the request data packets received by the server when the request fails are the same as those received by the server when the request succeeds.

Code example

try (PreparedStatement statement = prepareStatement(dataSource, sql)) { List useData = new ArrayList<>(); for (int i = 0; i < externalData.size(); i++) { ClickHouseExternalData tmpData = externalData.get(i); useData.add(new ClickHouseExternalData(tmpData.getName(), new ByteArrayInputStream(input.get(i).getBytes(StandardCharsets.UTF_8))) .withStructure(tmpData.getStructure()).withFormat(tmpData.getFormat())); } ClickHousePreparedStatement clickHousePreparedStatement = (ClickHousePreparedStatement) statement; try (ResultSet rs = clickHousePreparedStatement.executeQuery(null, useData)){ LOGGER.debug("ExecuteShardQuery: {},Use DataSource Url: [{}]", exeMethod, dataSource.getAllClickhouseUrls()); return callBack.call(rs); } }

Error log

First error log 2023.05.18 10:19:55.190332 [ 27874 ] {} DynamicQueryHandler: Code: 27, e.displayText() = DB::ParsingException: Cannot parse input: expected '\t' before: '\nsi\tslot1202\nsi\tslot1203\nsi\tslot1204\nsi\tslot1205\nsi\tslot1206\nsi\tslot1207\nsi\tslot1208\nsi\tslot1209\nsi\tslot1210\nsi\tslot1211\nsi\tslot1212\nsi\tslot1213\nsi\tslot1214\nsi\t': Row 1203: Column 0, name: tagname, type: String, parsed text: "si" Column 1, name: tagvalue, type: String, parsed text: "slot1201"

Row 1204: Column 0, name: tagname, type: String, parsed text: "201" ERROR: Line feed found where tab is expected. It's like your file has less columns than expected. And if your file have right number of columns, maybe it have unescaped backslash in value before tab, which cause tab has escaped.

: While executing SourceFromInputStream: (at row 1204) , Stack trace (when copying this message, always include the lines below):

  1. DB::throwAtAssertionFailed(char const*, DB::ReadBuffer&) @ 0x8667d25 in /usr/bin/clickhouse
  2. DB::TabSeparatedRowInputFormat::readRow(std::__1::vector<COW::mutable_ptr, std::__1::allocator<COW::mutable_ptr > >&, DB::RowReadExtension&) @ 0xfa43b52 in /usr/bin/clickhouse
  3. DB::IRowInputFormat::generate() @ 0xf970c58 in /usr/bin/clickhouse
  4. DB::ISource::tryGenerate() @ 0xf8ff205 in /usr/bin/clickhouse
  5. DB::ISource::work() @ 0xf8fedfa in /usr/bin/clickhouse
  6. DB::ParallelParsingInputFormat::InternalParser::getChunk() @ 0xf9c7aae in /usr/bin/clickhouse
  7. DB::ParallelParsingInputFormat::parserThreadFunction(std::__1::shared_ptr, unsigned long) @ 0xf9c710e in /usr/bin/clickhouse
  8. ThreadPoolImpl::worker(std::1::list_iterator<ThreadFromGlobalPool, void*>) @ 0x8654088 in /usr/bin/clickhouse
  9. ThreadFromGlobalPool::ThreadFromGlobalPool<void ThreadPoolImpl::scheduleImpl(std::1::function<void ()>, int, std::1::optional)::'lambda1'()>(void&&, void ThreadPoolImpl::scheduleImpl(std::1::function<void ()>, int, std::1::optional)::'lambda1'()&&...)::'lambda'()::operator()() @ 0x865603f in /usr/bin/clickhouse
  10. ThreadPoolImpl::worker(std::1::list_iterator<std::__1::thread, void*>) @ 0x865165f in /usr/bin/clickhouse
  11. ? @ 0x86550f3 in /usr/bin/clickhouse
  12. start_thread @ 0x7dc5 in /usr/lib64/libpthread-2.17.so
  13. __clone @ 0xf9d1d in /usr/lib64/libc-2.17.so (version 21.3.4.25 (official build))

Second error log 2023.05.18 10:26:41.058246 [ 2668 ] {} void DB::ParallelParsingInputFormat::onBackgroundException(size_t): Code: 27, e.displayText() = DB::ParsingException: Cannot parse input: expected ',' before: '\nsi,slot1706\nsi,slot1707\nsi,slot1708\nsi,slot1709\nsi,slot1710\nsi,slot1711\nsi,slot1712\nsi,slot1713\nsi,slot1714\nsi,slot1715\nsi,slot1716\nsi,slot1717\nsi,slot1718\nsi,': Row 1707: Column 0, name: tagname, type: String, parsed text: "si" Column 1, name: tagvalue, type: String, parsed text: "slot1705"

Row 1708: Column 0, name: tagname, type: String, parsed text: "5" ERROR: Line feed found where delimiter (,) is expected. It's like your file has less columns than expected. And if your file have right number of columns, maybe it have unescaped quotes in values.

: (at row 1708) , Stack trace (when copying this message, always include the lines below):

  1. DB::throwAtAssertionFailed(char const*, DB::ReadBuffer&) @ 0x8667d25 in /usr/bin/clickhouse
  2. ? @ 0xf991adb in /usr/bin/clickhouse
  3. DB::CSVRowInputFormat::readRow(std::__1::vector<COW::mutable_ptr, std::__1::allocator<COW::mutable_ptr > >&, DB::RowReadExtension&) @ 0xf991c2e in /usr/bin/clickhouse
  4. DB::IRowInputFormat::generate() @ 0xf970c58 in /usr/bin/clickhouse
  5. DB::ISource::tryGenerate() @ 0xf8ff205 in /usr/bin/clickhouse
  6. DB::ISource::work() @ 0xf8fedfa in /usr/bin/clickhouse
  7. DB::ParallelParsingInputFormat::InternalParser::getChunk() @ 0xf9c7aae in /usr/bin/clickhouse
  8. DB::ParallelParsingInputFormat::parserThreadFunction(std::__1::shared_ptr, unsigned long) @ 0xf9c710e in /usr/bin/clickhouse
  9. ThreadPoolImpl::worker(std::1::list_iterator<ThreadFromGlobalPool, void*>) @ 0x8654088 in /usr/bin/clickhouse
  10. ThreadFromGlobalPool::ThreadFromGlobalPool<void ThreadPoolImpl::scheduleImpl(std::1::function<void ()>, int, std::1::optional)::'lambda1'()>(void&&, void ThreadPoolImpl::scheduleImpl(std::1::function<void ()>, int, std::1::optional)::'lambda1'()&&...)::'lambda'()::operator()() @ 0x865603f in /usr/bin/clickhouse
  11. ThreadPoolImpl::worker(std::1::list_iterator<std::__1::thread, void*>) @ 0x865165f in /usr/bin/clickhouse
  12. ? @ 0x86550f3 in /usr/bin/clickhouse
  13. start_thread @ 0x7dc5 in /usr/lib64/libpthread-2.17.so
  14. __clone @ 0xf9d1d in /usr/lib64/libc-2.17.so (version 21.3.4.25 (official build))

Configuration

Environment

ClickHouse server

zhicwu commented 1 year ago

Hi @rhy429081492, have you validated data captured by Wireshack? Taking the first error log as an example, did you see si\tslot1201\n201\nsi\tslot1202 in Wireshack? Is that same as in the input.get(i)? If you're certain that you passed valid data to the JDBC driver, then it could be a bug.

Lastly, please pay attention that you're using an obsoleted driver which is no longer supported. I'd suggest you upgrade to a more recent version like 0.4.6.

rhy429081492 commented 1 year ago

Hi @rhy429081492, have you validated data captured by Wireshack? Taking the first error log as an example, did you see si\tslot1201\n201\nsi\tslot1202 in Wireshack? Is that same as in the input.get(i)? If you're certain that you passed valid data to the JDBC driver, then it could be a bug.

Lastly, please pay attention that you're using an obsoleted driver which is no longer supported. I'd suggest you upgrade to a more recent version like 0.4.6. I confirm that the packets captured by Wireshark contain this information. The data is valid. This problem occurs only when the data volume increases. Initially, my main purpose is to solve the problem that the latency increases due to excessive data after the in statement. In the latest version of the JDBC client, the interface that I use is no longer available. Use the new interface to solve this problem by creating temporary tables. However, the way temporary tables are created is not ideal for performance optimization. Does not perform as well as the interface I raised the issue with. When creating a temporary table, the new interface communicates with the server for multiple times. The packet of the new interface is completely different from that of the interface used by me. I'd rather do something similar to this in clickhouse docs. https://clickhouse.com/docs/en/engines/table-engines/special/external-data

zhicwu commented 1 year ago

Use the new interface to solve this problem by creating temporary tables. However, the way temporary tables are created is not ideal for performance optimization. Does not perform as well as the interface I raised the issue with.

Are you referring to the following example? The name might indeed be somewhat confusing. Internally, it utilizes an external table within the same http request. Moreover, you can invoke the ClickHouseExternalData.asTempTable() function if you prefer to employ a temporary table instead of an external table. This can be particularly useful for handling complex data types, as mentioned in ClickHouse/ClickHouse#32171.

https://github.com/ClickHouse/clickhouse-java/blob/c25ffc62c886c0cdc1fcf8d8bf2234690c131dca/examples/jdbc/src/main/java/com/clickhouse/examples/jdbc/Advanced.java#L63-L83

In conclusion, if your application demands high performance, I strongly recommend upgrading to 0.3.2-patch11(use com.clickhouse.jdbc.ClickHouseDriver since both old and new drivers co-exist in 0.3.2). This version offers a significant speed increase compared to its predecessors, as detailed in #768. Additionally, version 0.4 doubles the performance of nested type serialization. Furthermore, you might consider utilizing the Java client, which provides a lower-level API with minimal overhead, typically resulting in superior performance.