aloneguid / parquet-dotnet

Fully managed Apache Parquet implementation
https://aloneguid.github.io/parquet-dotnet/
MIT License
542 stars 140 forks source link

[BUG]: Using Nuget Package with dotnet 8 doesn't support DateOnly type. #472

Open mateli opened 5 months ago

mateli commented 5 months ago

Library Version

4.23.3

OS

Windows 11/Debian

OS Architecture

64 bit

How to reproduce?

Load any parquet file with Date fields, and they get translated to DateTime.

if NET6_0_OR_GREATER

                ConvertedType.DATE => options.UseDateOnlyTypeForDates ? typeof(DateOnly) : typeof(DateTime),

else

                ConvertedType.DATE => typeof(DateTime),

endif

When debugging using the rider decompiler the first branch is grayed out and "ConvertedType.DATE => typeof(DateTime)," is run. My guess is that this package is bult with a target framework less than dotnet 6 and as preprocessor directives are interpreted during compile time and not during execution this means that even if the package is used with dotnet 8 we don't get dotnet 8 features. Temporary solution is to get the source and manually build the package with dotnet 6+ as target.

Failing test

No response

aloneguid commented 5 months ago

To help you solve your problem, I need to understand your goal and the context of your situation. Could you please describe in detail what you are trying to achieve and why? It would be very helpful if you could also share a failing test case that illustrates the issue you are facing. This way, I can reproduce the error and debug it more easily.

mateli commented 4 months ago

I figured out that it were mainly a DataTable/DataFrame problem as neither support DateOnly and will convert them to DateTime.

This is the solution for me: First I created a Parquet project in my solution to make sure that it is built with dotnet 8. At this point I don't know if this is needed.

Then I created a replacement for DataTable/DataFrame where has a list of lists as columns.

Code that adds a column by creating a list using reflection.

        public void Add(Type type, String columnName, Boolean nullable, DataRow? schema = null) {
            if (nullable && type.IsValueType) {
                type = typeof(Nullable<>).MakeGenericType(type);
            }
            Columns.Add(new FastColumn {
                ColumnName = columnName,
                IsNullable = nullable,
                Schema = schema,
                ColumnData = (IList)Activator.CreateInstance(listType.MakeGenericType(type))!,
                ColumnType = type
            });
        }

Code that reads schema information from a DbDataReader and creates appropriate columns.

public static FastColumns FromSchema(DbDataReader reader) {
            var res = new FastColumns();
            var columnsCount = reader.FieldCount;
            var nullableNames = reader.GetSchemaTable()!.AsEnumerable()
                .ToDictionary(r => r["ColumnName"], r => (bool)r["AllowDBNull"]);
            Dictionary<object, DataRow?> schemaNames = reader.GetSchemaTable()!.AsEnumerable().ToDictionary(r => r["ColumnName"], r => r);

            for (var c = 0; c < columnsCount; c++) {
                var type = reader.GetFieldType(c);
                var name = reader.GetName(c);
                Console.WriteLine($"{name} {(nullableNames[name] ? "nullable" : "non-nullable")} {reader.GetDataTypeName(c)}");
                var column = reader.GetDataTypeName(c) switch {
                    "date" => typeof(DateOnly),
                    _ => type
                };
                res.Add(column, name, nullableNames[name], schemaNames[name]);
            }

            return res;
        }

Read database data as objects and add them to a row

        public static async Task<FastColumns> LoadFrom(DbDataReader reader) {
            var res = FromSchema(reader);
            var columnsCount = reader.FieldCount;

            var items = new object?[columnsCount];

            while (await reader.ReadAsync()) {
                object?[] values = new object?[columnsCount];
                for (var i = 0; i < columnsCount; i++) {
                    FastColumn c = res.Columns[i];
                    var nullable = c.IsNullable;
                    values[i] = reader.GetDataTypeName(i) switch {
                        "date" when nullable => reader.IsDBNull(i) ? null : reader.GetFieldValue<DateOnly>(i),
                        "date" when !nullable => reader.GetFieldValue<DateOnly>(i),
                        _ when nullable => reader.IsDBNull(i) ? null : reader[i],
                        _ => reader[i]
                    };
                }
                res.Rows.Add(values);
            }
            return res;
        }

Add a row using the List.Add function that takes an object. This will fail if the object doesn't match the runtime type of the target list.

            public void Add(object?[] items) {
                if (items.Length != fc.Columns.Count) {
                    throw new Exception("FastColumns.Rows.Adds: Incorrect number of items ");
                }
                for (var i = 0; i < items.Length; i++) {
                    fc.Columns[i].ColumnData!.Add(items[i]);
                }
            }

This turned out to be much faster than feeding data to a DataTable or DataFrame. Finally getting Arrays that can be fed do Parquet were simple.

        public Array ToArray(int i) {
            MethodInfo? toArrayListMethod = Columns[i].ColumnData.GetType().GetMethod("ToArray");
            return (Array)toArrayListMethod.Invoke(Columns[i].ColumnData, null)!;
        }

This code were not only much faster than DataTable/DataFrame but also handle DateOnly correctly. I was rather surprised that such a simple solution were faster as I expected that using reflection and passing objects when adding to list would be slow. But I figured out that DataTable and DataFrame uses even more reflection and passing around of objects.