DarkWanderer / ClickHouse.Client

.NET client for ClickHouse
MIT License
302 stars 58 forks source link

BulkCopy not raising exception on big datasets #475

Closed marcinatpayholdingcom closed 1 month ago

marcinatpayholdingcom commented 1 month ago

I was copying data from MSSQL to ClickHouse. My table was 2M rows. Each time I was running the copy I was missing some rows. I was not getting any exception during the copy, everything was running nice and smooth (batch size 10k), but the data was missing, I reduced the data to 1k rows executed and this is what I got:

ClickHouse.Client.Copy.ClickHouseBulkCopySerializationException: Error when serializing data
 ---> System.InvalidCastException: Object cannot be cast from DBNull to other types.
   at System.DBNull.System.IConvertible.ToInt16(IFormatProvider provider)
   at ClickHouse.Client.Types.Int16Type.Write(ExtendedBinaryWriter writer, Object value)
   at ClickHouse.Client.Copy.ClickHouseBulkCopy.BulkCopyHttpContent.SerializeBatchAsync(Stream stream)
   --- End of inner exception stack trace ---
   at ClickHouse.Client.Copy.ClickHouseBulkCopy.BulkCopyHttpContent.SerializeBatchAsync(Stream stream)
   at ClickHouse.Client.Copy.ClickHouseBulkCopy.BulkCopyHttpContent.SerializeToStreamAsync(Stream stream, TransportContext context)
   at System.Net.Http.HttpContent.<CopyToAsync>g__WaitAsync|56_0(ValueTask copyTask)
   at System.Net.Http.HttpConnection.SendRequestContentAsync(HttpRequestMessage request, HttpContentWriteStream stream, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnection.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnection.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.SendWithVersionDetectionAndRetryAsync(HttpRequestMessage request, Boolean async, Boolean doRequestAuth, CancellationToken cancellationToken)
   at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.DecompressionHandler.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpClient.<SendAsync>g__Core|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken)
   at ClickHouse.Client.ADO.ClickHouseConnection.PostContentAsync(String sql, HttpContent httpData, CancellationToken token)
   at ClickHouse.Client.Copy.ClickHouseBulkCopy.SendBatchAsync(BulkCopyHttpContent batchContent, CancellationToken token)
   at ClickHouse.Client.Copy.ClickHouseBulkCopy.WriteToServerAsync(IEnumerable`1 rows, CancellationToken token)
   at DataSync.Data.Clients.Clickhouse.ChClient.<>c__DisplayClass4_0.<<BulkInsertAsync>b__0>d.MoveNext()

Which explained me what is the problem. Grate, but it took me a long time to realize that I have wrong column definition in CH. But the exception was not thrown on big data copy, it was just silent.

DarkWanderer commented 1 month ago

Hi Thank you for your report, that would be a serious bug. Can you clarify a bit more on how this appeared - was the exception happening somewhere near the beginning of dataset or near the end?

Also, if you would be able to reproduce it - does the issue reproduce on previous versions, e.g. 6.8.1?

DarkWanderer commented 1 month ago

I have tried to reproduce the issue in a test and it doesn't seem to appear with a synthetic test - a serialization exception is always correctly propagated. Can you provide more context please?

marcinatpayholdingcom commented 1 month ago

I will try to build demo for you.

On 9 May 2024, at 12:11, Oleg V. Kozlyuk @.***> wrote:



I have tried to reproduce the issue in a test and it doesn't seem to appear with a synthetic test - a serialization exception is always correctly propagated. Can you provide more context please?

— Reply to this email directly, view it on GitHubhttps://github.com/DarkWanderer/ClickHouse.Client/issues/475#issuecomment-2102366337, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AYPRFDYB2YAGH2KUWALCNJTZBNDUBAVCNFSM6AAAAABHL2EAQSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMBSGM3DMMZTG4. You are receiving this because you authored the thread.Message ID: @.***>

DarkWanderer commented 1 month ago

Please try version 7.5.0 to see if it fixes the issue for you

marcinatpayholdingcom commented 1 month ago

Nop, still not working correctly, here is a demo to replicate this problem. Let me know, maybe I am doing something wrong. https://github.com/marcinatpayholdingcom/clickhouseclient_bug_replication

DarkWanderer commented 1 month ago

Sorry - took me a while to get my hands on it. However, for me it works perfectly:

Hello, World!

O:\Projects\clickhouseclient_bug_replication\ClickHouse.ClientNotSerializing\bin\Debug\net8.0\ClickHouse.ClientNotSerializing.exe (process 21308) exited with code 0.
Press any key to close this window . . .

Any additional details I am missing? Do you use any special flags or parameters (e.g. UseSessions)? What OS was this reproed on?

Here are specific exceptions which are shown in the debugger: image

image

marcinatpayholdingcom commented 1 month ago

I am running this on Windows this part should throw the serialization exception but it does not: image

I do not have any special setting, I just run it as it is in the project. So the problem is, it does not throw, so only part of the data is inserted and chunks with broken data are skipped.

DarkWanderer commented 1 month ago

For me it is triggered quite perfectly every time image

Just to double check, when you say "exception is not thrown", have you checked it in the debugger by putting a breakpoint on the logger._LogError line, or did you check only by looking at the logs? Because your test project does not output any logs even in case of error

marcinatpayholdingcom commented 1 month ago

by setting the brekpoin at the log line and by seeing what data got inserted into Clickhouse. When you get the serialization exception that means that no data got inserted into Clickhouse in the first attempt, am I right?

In my case there is some data inserted.

I think you can close this issue as not able to replicate. I have no clue what setup causes this situation on my machine. First I thought that it was because of the original project I am working on, maybe I did something wrong. But when I created this demo and things are still failing incorrectly then I give up.

DarkWanderer commented 1 month ago

I think there may be just a misunderstanding here, which means I'll need to make docs more clear

When you get the serialization exception that means that no data got inserted into Clickhouse in the first attempt, am I right?

ClickHouse is not transactional (other than experimental features) - and hence BulkCopy adapter isn't as well. If you have a large set of data and the serialization (or insertion) has failed for one of the batches, it is expected that some of the batches may have made it through already, and it falls on the higher-level code to handles the situation. I.e. in this particular case, 20k of rows get into database before exception - and that's by design.

If you need to have some semblance of atomicity, the data needs to go as one 'batch' - so you can try setting BatchSize to Int32.MaxValue (reference). It would be slow and memory-consuming, but you would get a binary result.