apla / node-clickhouse

Yandex ClickHouse driver for nodejs
MIT License
217 stars 50 forks source link

Stream large data #65

Open calebeaires opened 3 years ago

calebeaires commented 3 years ago

Topic open to community of this module

This topic is just to a discussion about the stream flow so we users can understand better the way this plugin handle stream with clickhouse.

Considera a amount os data that takes 10GB. Make a readstream then do a stream is quite easy with the node-clickhouse. About that:

  1. Can I consider that this flow does not supercharge clickhouse database?
  2. Looking at the documentation, whats the diferente between this, does is has diference in terms of performance:

A. Insert with stream

const writableStream = ch.query('INSERT INTO table FORMAT CSV', (err, result) => {})

B. Insert large data (without callback)

const clickhouseStream = ch.query('INSERT INTO table FORMAT TSV')
tsvStream.pipe(clickhouseStream)
  1. I read the Clickhouse docs. This setting make things goes right when well set. How does can I use insert_quorum to make stream write faster considering a single server (without replicas)?

  2. With node-clickhouse WriteStream, do I have to make my code take care of garbage collection so I must make use of pause/resume/drain?

KrishnaPG commented 3 years ago

for large files, the streams should also consider failure handling, pause, resume options in case of connection problems or any other network errors. It is not clear if this package handles those checkpoints.

yi commented 3 years ago

In real world production, I found it is weak to hold a write stream for large or long time entry insertion. Thus I wrote a wrap based on @apla/node-clickhouse, that supports: 1. failure retry. 2. restore data segments after process crash. and 3. single write process in node cluster mode. Hope that will be helpful: https://www.npmjs.com/package/clickhouse-cargo

KrishnaPG commented 3 years ago

Thank you @yi A quick view of your package looks great. Will try to switch to it