Open i-sinister opened 3 years ago
Just to make it clear - IEnumerable<object[]> itself can be cheap, however "whole pipeline" is slow - data is read from an "external source" and for each row some kind of DTO object is constructed which is then converted to object[] to adapt to BulkCopy api. If it would be possible to produce "data in binary format for bulk copy" I could get read of intermediate object which would reduce memory and CPU usage.
Digging deeper I've found that current batching implementation also causes problems because it stores object arrays for rows in a temporary array preventing them to be collected in 0 generation.
I've made small benchmark that reads ~30M integer tuples from file and stores in memory stream with gzip compression the way BulkCopy does. Here is the summary: method | duration | GC0 | GC1 | GC2 | Allocated bytes |
---|---|---|---|---|---|
direct write | 12.1415900 | 1117 | 8 | 8 | 5180987472 |
no buffering | 14.5800722 | 2071 | 10 | 8 | 9170904968 |
1M buffering | 38.2794354 | 1459 | 427 | 83 | 9178938824 |
Even though with 'no buffering' 'Allocated bytes' is twice higher then with 'direct write' it does no affect performance a lot - I just got 2 seconds penalty for gen0 object allocation and GC, but observed memory usage is about the same - around 280Mb, which is close to resulting memory stream size. However with buffering memory goes over 500Mb and overall it works 3 times slower because it does 10 times more expensive Gen2 collections.
So adding an overload what will ignore batching and just write from IEnumerable<object[]> to gzipped memory stream might solve my problem.
Hi. Sorry for long delay - needed a break to sit and ingest this thoroughly.
You make some good points here. Another user already asked to make PostStreamAsync
public (#61) - I think that should not cause issues for public API and will answer part of your query.
For another part, I will explore possibility to get rid of intermediate batching arrays - BulkCopy
was written under assumption that operations with object[]
are cheaper than writing to a MemoryStream
, but it actually might not be the case.
Current 'bulk copy' API expects IEnumerable<object[]> as an input. For performance reasons this API does not suit my needs - I have to insert large amount of "small" objects which implies large number of small array allocations and primitive value boxing which causes excessive memory allocations (and GC). So I need to "format data" into CH binary format outside the BulkCopy class and pass it inside somehow together with the schema. I can not use overload that accepts datareader because internally it still allocates object arrays. Batching might also be necessary, so simplest version would look like:
Currently it is not possible because BinaryStreamWriter and related "type API" are internal. Also
connection.PostStreamAsync(query, stream, true, token)
is internal so even if I did rewrote BinaryStreamWriter I still can not use connection to send the data.Also it would be nice to be able write directly into network stream (like "json formatters" do) instead or writing to memory stream because memory stream growing is another potential bottleneck - currently allocated memory stream is 0.5MB and if I need to send 1Gb of data it would be resized and copied 11 time consuming 2Gb in total and polluting LOH. For this one connection API should be extended.