How to insert a quite lot of data in batch on stream mode

frank-zsy commented 1 year ago

Hi, I am using this client in my project and I would like to use insert stream mode.

I init my own Readable stream and pass it into the insert function, I will push like millions of records into the stream but right now I found data will not be really inserted into the database until the stream returns null. This leads to a memory increase so I need to split the insertion into smaller pieces by myself.

Is there any option I can use and set to insert the records in batches and decrease the memory use of the program?

slvrtrn commented 1 year ago

Hello, thanks for reaching out.

Are you experiencing a memory increase in your program or the ClickHouse server?

When we were doing internal testing for clickhouse-js, we ran something like this for days, and our Node memory consumption stayed the same (and quite low).

Maybe this example will help in your case.

frank-zsy commented 1 year ago

Thanks for the reply and the code.

The memory increase is on my program but not ClickHouse server. I looked into the code and the only difference is:

// Mine
const stream = new Readable({ objectMode: true, read: () => { } });
// read records from local graph database
read().then(record => stream.push(record)).then(() => stream.push(null));
await client.insert({ table: 'test', values: stream, format: 'JSONEachRow' });

// Yours
const stream = new Readable({ objectMode: true, read: () => { } });
const insertPromise = client.insert({ table: 'test', values: stream, format: 'JSONEachRow' });
// read records from local graph database
read().then(record => stream.push(record)).then(() => stream.push(null));
await insertPromise;

So you called the client.insert before push object into stream and then wait for the insert done.

But I think they are the same, I will try your way to see the result.

frank-zsy commented 1 year ago

After change the statements orders, the records will be inserted about 1.1 million a batch.

But there is definitely a memory leak in my code but I can not find out where, I will use heap dump to inspect my code.

slvrtrn commented 1 year ago

Please share the results; if it's in the driver, we will fix it.

frank-zsy commented 1 year ago

const stream = new Readable({ objectMode: true, read: () => { } });
runQueryStream(`MATCH (n:Repo) RETURN n`, async node => {
  const repo_id = parseInt(node.n.id);
  const repo_name = node.n.name;
  const org_id = parseInt(node.n.org_id);
  const org_login = node.n.org_login;
  for (const p of Object.keys(node.n)) {
    if (p.startsWith('open_rank_')) {
      const time = p.substring('open_rank_'.length);
      const openrank = parseFloat(node.n[p]);
      const year = parseInt(time.substring(0, 4));
      const month = parseInt(time.substring(4));
      const created_at = `${year}-${month.toString().padStart(2, '0')}-01`;
      stream.push([repo_id, repo_name, org_id, org_login, created_at, openrank]);
    }
  }
}).then(() => stream.push(null));
await client.insert({
  table: 'repo_openrank',
  values: stream,
  format: 'JSONCompactEachRow',
});
stream.destroy();

Here is my code, the runQueryStream function is a Neo4j graph database driver which runs a query and consumes results by the callback function.

I can not find where the leak is, but the memory keeps increasing until OOM even I give 8GB to my script.

I dumped the memory heap twice on 3 seconds and 10 seconds after the insertion starts. The first one is:

And the second one is:

The objects count is 428650 and 1610822, I think here is the leak and the objects(array object rather than single variables or string) are what I pushed into the stream, so I think somehow the stream holds the objects. But I looked into the client code and still don't find any where could hold any object.

More information here:

frank-zsy commented 1 year ago

I am not quite familiar with memory heap analysis, here is the full 10s heapsnapshot file https://blog.frankzhao.cn/Heap.20221209.215843.18916.0.002.heapsnapshot.zip , if you want to inspect more on the it, you can download it.

slvrtrn commented 1 year ago

@frank-zsy thanks for the example. Is it possible to reduce the code into a single file with a minimal reproducible scenario without external libraries/calls? It will greatly help with the investigation on our end.

frank-zsy commented 1 year ago

@slvrtrn I wrote an example and I think I know what is the problem here. The input records can not be consumed in time is what causes the memory increase.

If I push 10,000 records every 1 millisecond, the memory will quickly exceed 2GB, but if I push 1,000 records every 10 millisecond, all works fine. Records will be inserted into server about 1 million a time and the memory usage will drop after a batch insertion. So I think there is no memory leak in the program but just too much data on input side.

But what I am not sure is why the data consumption process is quite slow. I just start to use this client since the ClickHouse server upgrade to 22.8, before that we are using CH 20.8 and I use this one as our client. I also start a stream to insert data and all works fine with it in same input code.

The code with node-clickhouse looks like this(I use JSONEachRow in old code, so there are much more network traffic than JSONCompactEachRow I think):

const stream = client.query('INSERT INTO repo_openrank', { format: 'JSONEachRow' });
stream.on('error', e => { this.logger.error(e); });
await runQueryStream('MATCH (n:Repo) RETURN n', async node => {
  const repo_id = parseInt(node.n.id);
  const repo_name = node.n.name;
  const org_id = parseInt(node.n.org_id);
  const org_login = node.n.org_login;
  for (const p of Object.keys(node.n)) {
    if (p.startsWith('open_rank_')) {
      const time = p.substring('open_rank_'.length);
      const openrank = parseFloat(node.n[p]);
      const year = parseInt(time.substring(0, 4));
      const month = parseInt(time.substring(4));
      const created_at = new Date(`${year}-${month.toString().padStart(2, '0')}-01`);
      stream.write({ repo_id, repo_name, org_id, org_login, created_at, openrank });
    }
  }
});
stream.end();

And for inserting about 5 billion records into a table(with GHArchive data, split about 2 million records into one single file), the old client will finish in about 36 hours but for now it is about 50 hours. So it is slower than the old one. I can slow down the insertion process but I also think we can speed up the consumption process.

slvrtrn commented 1 year ago

thanks for the input @frank-zsy we will look into it

frank-zsy commented 1 year ago

Thanks, I will close this issue and slow down the input process will fix my problem for now.

slvrtrn commented 1 year ago

@frank-zsy

If I push 10,000 records every 1 millisecond, the memory will quickly exceed 2GB, but if I push 1,000 records every 10 millisecond, all works fine. Records will be inserted into server about 1 million a time and the memory usage will drop after a batch insertion. So I think there is no memory leak in the program but just too much data on input side.

I just realized that we've got another similar report recently (see https://github.com/ClickHouse/clickhouse-js/issues/135) and I think that that example might help you speed up the insertion rate without running out of memory, taking backpressure into consideration. Please have a look: https://github.com/ClickHouse/clickhouse-js/issues/135#issuecomment-1435671544

ClickHouse / clickhouse-js

How to insert a quite lot of data in batch on stream mode #127