aloneguid / parquet-dotnet

Fully managed Apache Parquet implementation
https://aloneguid.github.io/parquet-dotnet/
MIT License
542 stars 140 forks source link

[BUG]: Deserializing nonnullable string field into nullable string property throws exception #512

Closed rasmushalland closed 4 weeks ago

rasmushalland commented 1 month ago

Library Version

4.23.5

OS

Windows 11

OS Architecture

64 bit

How to reproduce?

  1. Have a parquet file with a string column that is not nullable.
  2. Use ParquetSerializer.DeserializeAsync to deserialize.
  3. Boom. With the attached test I get the following exception:
System.IO.InvalidDataException
property 'S1' is declared as 'S1 (System.String?)' but source data has it as 'S1 (System.String)'
   at Parquet.Serialization.ParquetSerializer.DeserializeRowGroupAsync[T](ParquetRowGroupReader rg, ParquetSchema schema, Assembler`1 asm, ICollection`1 result, CancellationToken cancellationToken, Boolean resultsAlreadyAllocated) in C:\Projects\parquet-dotnet\src\Parquet\Serialization\ParquetSerializer.cs:line 480
   at Parquet.Serialization.ParquetSerializer.DeserializeRowGroupAsync[T](ParquetReader reader, Int32 rgi, Assembler`1 asm, ICollection`1 result, CancellationToken cancellationToken, Boolean resultsAlreadyAllocated) in C:\Projects\parquet-dotnet\src\Parquet\Serialization\ParquetSerializer.cs:line 448
   at Parquet.Serialization.ParquetSerializer.DeserializeAsync[T](Stream source, ParquetOptions options, CancellationToken cancellationToken) in C:\Projects\parquet-dotnet\src\Parquet\Serialization\ParquetSerializer.cs:line 285
   at Parquet.Test.Serialisation.ParquetSerializerTest.NotNullableString() in C:\Projects\parquet-dotnet\src\Parquet.Test\Serialisation\ParquetSerializerTest.cs:line 884
   at System.RuntimeMethodHandle.InvokeMethod(Object target, Void** arguments, Signature sig, Boolean isConstructor)
   at System.Reflection.MethodBaseInvoker.InvokeWithNoArgs(Object obj, BindingFlags invokeAttr)

Failing test

sealed class NotNullableString_Row {
            public string? S1 { get; set; }
        }

        [Fact]
        public void NotNullableString() {
            var schema = new ParquetSchema(
                new DataField("S1", typeof(string), isNullable: false));

            var ms = new MemoryStream();
            {
                using ParquetWriter writer = ParquetWriter.CreateAsync(schema, ms).GetAwaiter().GetResult();
                using ParquetRowGroupWriter groupWriter = writer.CreateRowGroup();

                groupWriter.WriteColumnAsync(
                        new DataColumn(
                            schema.DataFields[0],
                            new[] { "x" }))
                    .GetAwaiter()
                    .GetResult();
            }
            {
                ms.Position = 0;
                // "System.IO.InvalidDataException : property 'S1' is declared as 'S1 (System.String?)' but source data has it as 'S1 (System.String)'"
                IList<NotNullableString_Row> de = ParquetSerializer.DeserializeAsync<NotNullableString_Row>(ms).GetAwaiter().GetResult();
                Assert.Equal(1, de.Count);
            }
        }
aloneguid commented 4 weeks ago

This is not a bug. In parquet data types must match, it's a stronly typed format.

rasmushalland commented 3 weeks ago

Fair point. Is this degree of strictness a parquet-the-format thing or a parquet.net-thing? I mean, many people would consider c# a strongly typed language, and c# allows assignment from not-nullable-string to nullable-string.

aloneguid commented 3 weeks ago

Fair point. Is this degree of strictness a parquet-the-format thing or a parquet.net-thing? I mean, many people would consider c# a strongly typed language, and c# allows assignment from not-nullable-string to nullable-string.

Its a parquet format thing, not c# specific.