Closed igolka97 closed 7 months ago
We need to check if it will work with streaming; IIRC, we have no VALUES clause because of that.
@igolka97, can you please provide samples of your code that you use as a workaround and describe what you are trying to achieve with more details? I am trying to understand the use-case better. Maybe it could be solved without modifications to the client, and in that case, we will just add a doc entry and an example.
Regarding the "collectors" of insert queries: does async insert (maybe with tweaked async_insert_max_data_size and async_insert_busy_timeout_ms) solve your problem without using an external "insert collector"?
@igolka97, can you please provide samples of your code that you use as a workaround and describe what you are trying to achieve with more details? I am trying to understand the use-case better. Maybe it could be solved without modifications to the client, and in that case, we will just add a doc entry and an example.
const keys = Object.keys(batch[0]);
const values = batch.map((obj) =>
`(${keys.map((key) => this.formatSQLValue(obj[key])).join(', ')})`,
).join(', ');
const queryCommand = `INSERT INTO ${this.tableName} (${keys.join(', ')}) VALUES ${values}`;
return this.clickhouseClient
.exec({
query: queryCommand,
})
Regarding the "collectors" of insert queries: does async insert (maybe with tweaked async_insert_max_data_size and async_insert_busy_timeout_ms) solve your problem without using an external "insert collector"?
I have already considered this option, but the fact is that such “collectors” can buffer data on disk if one of the shards is unavailable, which gives a big advantage
I am very close to your opinion of finding a solution without changing the client code, I will be very happy if we can find one.
Perhaps, in an ideal world, the async_insert "component" would be separate, either bundled or standalone from clickhouse, such as clickhouse-keeper. With the ability to buffer raw data to disk if one of the shards is unavailable. This seems like an interesting thought. This can probably be done now if run clickhouse exclusively with a distributed engine table, but it looks disgusting. Your opinion on this matter is very interesting, perhaps I am completely wrong.
@mshustov, wdyt?
but the fact is that such "collectors" can buffer data on disk if one of the shards is unavailable
It opens the door to many other problems: lack of redundancy in the setup, a single point of failure in the "collector", the necessity to maintain and monitor 3rd party components, etc. I think you should consider using a proper replication setup + async_insert as recommended above.
If you want to combine queue and storage functionality in a 3rd party component, consider using Kafka or other alternatives.
@slvrtrn @mshustov
I migrated to async_insert with chproxy before cluster and JSONEachRow insert format. Each shard started to consume rather more CPU resources. About 35-50% instead 2-10% before. Still pre-batching data on client replicas. Any thoughts maybe?
The community would greatly benefit if we ensured that the async insert works for all use cases instead of implementing a workaround for each ecosystem. @igolka97, could you help us with the investigation? We'd appreciate if you could create an issue in the ClickHouse repository (or add more details to https://github.com/ClickHouse/ClickHouse/issues/50113). What additional information might be needed:
I appreciate your cooperation!
Use case
There are many collectors of insert queries (such as KittenHouse or clickhouse-bulk) that works properly only with VALUES format of inserts.
Describe the solution you'd like
Add “Values” format to “format” option on insert queries.
Describe the alternatives you've considered
Right now I using .exec method and generating insert string myself. But exec method exists for another purposes and not for inserts
I am going to research on the source code to introduce this feature myself, but no any promises