DarkWanderer / ClickHouse.Client

.NET client for ClickHouse
MIT License
321 stars 66 forks source link

Expose API to support binary payload for bulkcopy externally #60

Open i-sinister opened 3 years ago

i-sinister commented 3 years ago

Current 'bulk copy' API expects IEnumerable<object[]> as an input. For performance reasons this API does not suit my needs - I have to insert large amount of "small" objects which implies large number of small array allocations and primitive value boxing which causes excessive memory allocations (and GC). So I need to "format data" into CH binary format outside the BulkCopy class and pass it inside somehow together with the schema. I can not use overload that accepts datareader because internally it still allocates object arrays. Batching might also be necessary, so simplest version would look like:

interface IBulkCopyData
{
     // this property can be passed as another argument also
     ColumnsAndType[] Schema { get; }
     /// return true if there is another batch that should be formatted
     Task<bool> Next(BinaryStreamWriter writter);
}
...
var myBulkCopyData = new MyBulkCopyData(source);
using var bulkCopy = new ClickHouseBulkCopy(clickHouseConnection);
// BatchSize and MaxDegreeOfParallelism are irrelevant because batching is drived
// by IBulkCopyData and maxdegreeofparallelist is limited to 1
await bulkCopy.WriteToServerAsync(dataReader cancellationToken);

Currently it is not possible because BinaryStreamWriter and related "type API" are internal. Also connection.PostStreamAsync(query, stream, true, token) is internal so even if I did rewrote BinaryStreamWriter I still can not use connection to send the data.

Also it would be nice to be able write directly into network stream (like "json formatters" do) instead or writing to memory stream because memory stream growing is another potential bottleneck - currently allocated memory stream is 0.5MB and if I need to send 1Gb of data it would be resized and copied 11 time consuming 2Gb in total and polluting LOH. For this one connection API should be extended.

i-sinister commented 3 years ago

Just to make it clear - IEnumerable<object[]> itself can be cheap, however "whole pipeline" is slow - data is read from an "external source" and for each row some kind of DTO object is constructed which is then converted to object[] to adapt to BulkCopy api. If it would be possible to produce "data in binary format for bulk copy" I could get read of intermediate object which would reduce memory and CPU usage.

i-sinister commented 3 years ago

Digging deeper I've found that current batching implementation also causes problems because it stores object arrays for rows in a temporary array preventing them to be collected in 0 generation.

I've made small benchmark that reads ~30M integer tuples from file and stores in memory stream with gzip compression the way BulkCopy does. Here is the summary: method duration GC0 GC1 GC2 Allocated bytes
direct write 12.1415900 1117 8 8 5180987472
no buffering 14.5800722 2071 10 8 9170904968
1M buffering 38.2794354 1459 427 83 9178938824

Even though with 'no buffering' 'Allocated bytes' is twice higher then with 'direct write' it does no affect performance a lot - I just got 2 seconds penalty for gen0 object allocation and GC, but observed memory usage is about the same - around 280Mb, which is close to resulting memory stream size. However with buffering memory goes over 500Mb and overall it works 3 times slower because it does 10 times more expensive Gen2 collections.

So adding an overload what will ignore batching and just write from IEnumerable<object[]> to gzipped memory stream might solve my problem.

DarkWanderer commented 3 years ago

Hi. Sorry for long delay - needed a break to sit and ingest this thoroughly.

You make some good points here. Another user already asked to make PostStreamAsync public (#61) - I think that should not cause issues for public API and will answer part of your query.

For another part, I will explore possibility to get rid of intermediate batching arrays - BulkCopy was written under assumption that operations with object[] are cheaper than writing to a MemoryStream, but it actually might not be the case.