Giorgi / DuckDB.NET

Bindings and ADO.NET Provider for DuckDB
https://duckdb.net
MIT License
409 stars 69 forks source link

DuckDBAppender: Add enum support #210

Closed eschgi closed 1 month ago

eschgi commented 1 month ago

Hello,

the DuckDBAppender class now supports adding enum values.

I want to get some feedback from you @Giorgi, if the code is ok and if i should improve the error handling in the EnumVectorDataWriter class?

I can also add additional tests if needed.

Issue: https://github.com/Giorgi/DuckDB.NET/issues/209

Kind regards Stefan

coveralls commented 1 month ago

Pull Request Test Coverage Report for Build 10453973952

Details


Changes Missing Coverage Covered Lines Changed/Added Lines %
DuckDB.NET.Data/Internal/Writer/ListVectorDataWriter.cs 6 7 85.71%
DuckDB.NET.Data/Internal/Writer/VectorDataWriterBase.cs 2 3 66.67%
DuckDB.NET.Data/Internal/Writer/EnumVectorDataWriter.cs 38 48 79.17%
<!-- Total: 48 60 80.0% -->
Files with Coverage Reduction New Missed Lines %
DuckDB.NET.Data/Internal/Writer/VectorDataWriterBase.cs 1 81.67%
<!-- Total: 1 -->
Totals Coverage Status
Change from base Build 10090837361: -0.4%
Covered Lines: 1757
Relevant Lines: 1926

💛 - Coveralls
Giorgi commented 1 month ago

Thanks for the PR, looks good!

Some comments:

eschgi commented 1 month ago

Hi @Giorgi,

thanks for the feedback.

I will add the missing tests and also check what happens if an enum value is out of range.

Regarding your second point:

It is currently possible with the latest released version of DuckDB.NET to use string parameters to set enum fields. It would make sense to me if the DuckDBAppender class also supports this. At my work, we may be using DuckDB.NET in the future, so this feature would help me. I will improve the performance so that developers who don't need this don't have a disadvantage. Also other libraries like Npgsql support the same feature (https://www.npgsql.org/doc/types/enums_and_composites.html#reading-and-writing-unmapped-enums).

Here is my example code where I use string parameters to set an enum field:

using System.Data;
using DuckDB.NET.Data;

File.Delete("file.db");

using var duckDBConnection = new DuckDBConnection("Data Source=file.db");
duckDBConnection.Open();

using var command = duckDBConnection.CreateCommand();

command.CommandText = "CREATE TYPE test_enum AS ENUM ('test1', 'test2', 'test3')";
command.ExecuteNonQuery();

command.CommandText = "CREATE TABLE test (a test_enum not null);";
command.ExecuteNonQuery();

command.CommandText = "INSERT INTO test (a) VALUES ($a);";
foreach (var enumValue in new string[] { "test1", "test2", "test3" })
{
    command.Parameters.Clear();
    command.Parameters.Add(new DuckDBParameter("a", enumValue));
    command.ExecuteNonQuery();
}

command.CommandText = "SELECT a FROM test;";
var reader = command.ExecuteReader();
while (reader.Read())
{
    string enumValue = reader.GetString("a");
    Console.WriteLine(enumValue);
}

@Giorgi With the above explanation, is it ok for you if we support writing strings to an Enum column?

Kind regards Stefan

eschgi commented 1 month ago

Hi @Giorgi,

something related.

As far as I understand it, in DuckDB.NET the enum values are mapped via the numeric value and not according to the enum value names.

In the Npgsql library, for example, the enum value name is used for the mapping. I would prefer such an approach. See the following example code:

using System.Data;
using Npgsql;
using NpgsqlTypes;

var connectionString = "Host=127.0.0.1;Username=postgres;Password=xxx;Database=test_npgsql";

var dataSourceBuilder = new NpgsqlDataSourceBuilder(connectionString);
dataSourceBuilder.MapEnum<TestEnum>();
var dataSource = dataSourceBuilder.Build();

var npgsqlConnection = dataSource.OpenConnection();

using var command = npgsqlConnection.CreateCommand();

command.CommandText = "DROP TABLE IF EXISTS test;";
command.ExecuteNonQuery();

command.CommandText = "DROP TYPE IF EXISTS test_enum;";
command.ExecuteNonQuery();

command.CommandText = "CREATE TYPE test_enum AS ENUM ('test1', 'test2', 'test3')";
command.ExecuteNonQuery();

command.CommandText = "CREATE TABLE test (a test_enum not null);";
command.ExecuteNonQuery();

npgsqlConnection.ReloadTypes();

command.CommandText = "INSERT INTO test (a) VALUES (:a);";
foreach (var enumValue in new string[] { "test1", "test2", "test3" })
{
    command.Parameters.Clear();
    command.Parameters.Add(new NpgsqlParameter("a", enumValue) { DataTypeName = "test_enum" });
    command.ExecuteNonQuery();
}

command.CommandText = "SELECT a FROM test ORDER BY a ASC;";
var reader = command.ExecuteReader();
while (reader.Read())
{
    TestEnum enumValue = reader.GetFieldValue<TestEnum>("a");
    Console.WriteLine(enumValue);
}

public enum TestEnum
{
    [PgName("test3")]
    Test_3,
    [PgName("test2")]
    Test_2,
    [PgName("test1")]
    Test_1,
}

@Giorgi Do we want to change this?

Kind regards Stefan

Giorgi commented 1 month ago

Ok, let's keep strings too. In such case can you move enumValueIndexDictionary initialization from ctor to the AppendString method? But make sure to initialize it only once.

Also, can you update the test to read the Enum that was appended as string in the Enum type and vice-versa to test all the combinations?

Thanks!

Giorgi commented 1 month ago

Hi @Giorgi,

something related.

As far as I understand it, in DuckDB.NET the enum values are mapped via the numeric value and not according to the enum value names.

In the Npgsql library, for example, the enum value name is used for the mapping. I would prefer such an approach. See the following example code:

using System.Data;
using Npgsql;
using NpgsqlTypes;

var connectionString = "Host=127.0.0.1;Username=postgres;Password=xxx;Database=test_npgsql";

var dataSourceBuilder = new NpgsqlDataSourceBuilder(connectionString);
dataSourceBuilder.MapEnum<TestEnum>();
var dataSource = dataSourceBuilder.Build();

var npgsqlConnection = dataSource.OpenConnection();

using var command = npgsqlConnection.CreateCommand();

command.CommandText = "DROP TABLE IF EXISTS test;";
command.ExecuteNonQuery();

command.CommandText = "DROP TYPE IF EXISTS test_enum;";
command.ExecuteNonQuery();

command.CommandText = "CREATE TYPE test_enum AS ENUM ('test1', 'test2', 'test3')";
command.ExecuteNonQuery();

command.CommandText = "CREATE TABLE test (a test_enum not null);";
command.ExecuteNonQuery();

npgsqlConnection.ReloadTypes();

command.CommandText = "INSERT INTO test (a) VALUES (:a);";
foreach (var enumValue in new string[] { "test1", "test2", "test3" })
{
    command.Parameters.Clear();
    command.Parameters.Add(new NpgsqlParameter("a", enumValue) { DataTypeName = "test_enum" });
    command.ExecuteNonQuery();
}

command.CommandText = "SELECT a FROM test ORDER BY a ASC;";
var reader = command.ExecuteReader();
while (reader.Read())
{
    TestEnum enumValue = reader.GetFieldValue<TestEnum>("a");
    Console.WriteLine(enumValue);
}

public enum TestEnum
{
    [PgName("test3")]
    Test_3,
    [PgName("test2")]
    Test_2,
    [PgName("test1")]
    Test_1,
}

@Giorgi Do we want to change this?

Kind regards Stefan

I think this won't work with Appender because when appending string as Enum we don't know the underlying C# Enum.

eschgi commented 1 month ago

Hi @Giorgi,

i think i could get the mapping working, my question is more, if mapping enums by name would be also a better approach for you? :) In my opinion, this approach is less error-prone.

Giorgi commented 1 month ago

Let's keep it as it is for now.

eschgi commented 1 month ago

Ok, I'll make the necessary changes as discussed.

eschgi commented 1 month ago

Hi @Giorgi,

I implemented all the discussed changes, except the lazy initialization of the enumValueIndexDictionary / enumValues dictionary.

During writing the list tests I noticed that the logical type handle is released earlier than expected, so a lazy initialization is currently not so easy to implement (at least for me).

Can you please review the PR again?

Kind regards Stefan

Giorgi commented 1 month ago

I'll review it later today. Why did you need to add WriteItemsFallback ?

eschgi commented 1 month ago

I'll review it later today. Why did you need to add WriteItemsFallback ?

The cast to IEnumerable<object> does not work when you have a list with enums.