Giorgi / DuckDB.NET

Bindings and ADO.NET Provider for DuckDB
https://duckdb.net
MIT License
343 stars 62 forks source link

DuckDBException when inserting some Unicode characters using appender #80

Closed JackMyers001 closed 1 year ago

JackMyers001 commented 1 year ago

When using the DuckDBAppender, I am receiving exceptions when inserting some Unicode characters. I believe this is a DuckDB.NET issue, as similar code works fine with Python.

Example code (uses the new .NET6 console application template):

using DuckDB.NET.Data;

var USE_APPENDER = true;

var DB_FILE = "csharptest.duckdb";
var TABLE_NAME = "TestTable";

File.Delete(DB_FILE);
File.Delete($"{DB_FILE}.wal");

using var db = new DuckDBConnection($"DataSource={DB_FILE}");
db.Open();

var dbCommand = db.CreateCommand();

dbCommand.CommandText = $"CREATE TABLE {TABLE_NAME}(index INTEGER, words VARCHAR);";
dbCommand.ExecuteNonQuery();

var words = new List<string> { "hello", "안녕하세요", "Ø3mm CHAIN" };

if (USE_APPENDER) {
    using var appender = db.CreateAppender(TABLE_NAME);

    for (int i = 0; i < words.Count; i++) {
        Console.WriteLine($"Creating row {i}...");

        var row = appender.CreateRow();
        row.AppendValue(i).AppendValue(words[i]);

        row.EndRow();
    }

    appender.Close();
} else {
    for (int i = 0; i < words.Count; i++) {
        Console.WriteLine($"Creating row {i}...");

        dbCommand.CommandText = $"INSERT INTO {TABLE_NAME} VALUES ({i}, '{words[i]}');";
        dbCommand.ExecuteNonQuery();
    }
}

db.Close();

If USE_APPENDER = true, the program will crash when inserting the 3rd word with the following log:

Creating row 0...
Creating row 1...
Creating row 2...
Unhandled exception. DuckDB.NET.Data.DuckDBException (0x80004005): INTERNAL Error: Invalid unicode (byte sequence mismatch) detected in segment statistics update
   at DuckDB.NET.Data.DuckDBAppender.ThrowLastError(DuckDBAppender appender)
   at DuckDB.NET.Data.DuckDBAppender.Close()
   at Program.<Main>$(String[] args) in C:\Users\jmyers\projects\duckdb-test\Program.cs:line 33

If USE_APPENDER = false, the program will complete successfully. Opening up the database with Tad reveals that all data was inserted successfully, with no encoding issues.

The only way I have found to insert strings containing Ø has been to convert the string to ASCII like Encoding.ASCII.GetString(Encoding.UTF8.GetBytes(s)), however this results in the Ø being replaced with ?? - not exactly desirable.

Other information:

Giorgi commented 1 year ago

This should now be fixed in develop branch.

Giorgi commented 1 year ago

@JackMyers001 You should be able to get the fix from GitHub Packages

JackMyers001 commented 1 year ago

I can confirm this patch has fixed my issue! Thanks so much for fixing it so quickly :)

The only oddity I have noticed (and I need to do more testing to confirm if it's related to this patch) is it appears that insert time has taken a hit; inserting some 1.3 million rows used to take ~280 seconds, but now takes more like 370. Again, need to do more testing to confirm.

Giorgi commented 1 year ago

I made some changes to how strings are marshaled to native code. Can you check if the latest preview package is faster?

JackMyers001 commented 1 year ago

Apologies for the slow response.

I had run some testing a week or so ago with DuckDB.NET.Data.Full version 0.6.1-alpha.9 and I could have sworn that I remember seeing no performance change (i.e. still ~370 seconds). However, I have just run two runs over 1.3 million rows (the same data as before) and they both completed in ~255 seconds. I also tried out 0.6.1-alpha.10, and that appears to have within margin-of-error performance gains.

The difference in my memory vs testing now could be a slight coincidence as I have changed a couple of settings for my test VM, however I didn't change the amount of RAM / CPU cores... Maybe I'm just losing it 🤣

Anyhow, it does appear that the second patch has fixed that performance degradation. I really appreciate your help!

Giorgi commented 1 year ago

Nice to hear that it runs fast again. As 0.7.0 was released yesterday, I'll push a new build in a day or two for these fixes and include the 0.7.0 build too.

Consider Sponsoring the Project if you want to support it.