aloneguid / parquet-dotnet

Fully managed Apache Parquet implementation
https://aloneguid.github.io/parquet-dotnet/
MIT License
542 stars 141 forks source link

Class deserializer should not require ParquetRequired attribute for required strings #434

Open rachied opened 7 months ago

rachied commented 7 months ago

Library Version

4.17

OS

macOS, Windows

OS Architecture

64 bit

How to reproduce?

I'm trying to use class deserialization, but it seems like there is a bug with regards to required strings. The schema contains a required string field, lets call it MY_STR_FIELD for ease. And no matter if I define my class with a string or a string? (or even System.String) it fails with the same error:

System.IO.InvalidDataException
property 'MY_STR_FIELD' is declared as 'MY_STR_FIELD (System.String?)' but source data has it as 'MY_STR_FIELD (System.String)'
   at Parquet.Serialization.ParquetSerializer.DeserializeRowGroupAsync[T](ParquetRowGroupReader rg, ParquetSchema schema, Assembler`1 asm, ICollection`1 result, CancellationToken cancellationToken)
   at Parquet.Serialization.ParquetSerializer.DeserializeRowGroupAsync[T](ParquetReader reader, Int32 rgi, Assembler`1 asm, ICollection`1 result, CancellationToken cancellationToken)
   at Parquet.Serialization.ParquetSerializer.DeserializeAsync[T](Stream source, ParquetOptions options, CancellationToken cancellationToken)

Which I find strange, because surely a string fits into a string? right? I tried adjusting the compare method in DataField.cs to take nullability into account and at least continue on with parsing, but that results in other errors being thrown when decoding.

The parquet file contains sensitive info so I'm not able to share it. I'm working on creating a dummy file that exhibits the same behaviour. For reference, the ParquetViewer Online is able to read this parquet and list the schema. There it became clear that MY_STR_FIELD is the only string field with Definition Level 0 in the schema.

Screenshot 2023-11-21 at 16 47 31

I'm assuming this is synonymous to being 'required'. But again, this doesn't mean that my class MUST define non-nullable properties. And even if I did, the same error is raised.

Failing test

No response

yolave commented 7 months ago

I'm facing exactly the same problem. I have a class with all the properties and types of this parquet file. I started first setting them as non-mandatory (nullable) and then the serialization failed with the same message:

System.IO.InvalidDataException: property 'ForecastTimestamp' is declared as 'ForecastTimestamp (System.String?)' but source data has it as 'ForecastTimestamp (System.String)

Afterwards, I set all the properties of my class as mandatory (non-null) but I got the same message again.

Any clues on this?

rachied commented 7 months ago

@yolave my current workaround is to use the Table and Rows objects instead. Iterate over the table, for each row you can call .ToString() to get the JSON representation, and then simply JSON deserialize that.

Here's a snippet:


        using var expectedReader = await ParquetReader.CreateAsync(expectedStream);
        var expectedTable = await expectedReader.ReadAsTableAsync();

        var expectedItems = expectedTable
            .Select(x => x.ToString())
            .Select(JsonConvert.DeserializeObject<T>)
            .ToDictionary(x => x.GetPrimaryKey());
yolave commented 7 months ago

Thanks for sharing this workaround @rachied . I finally solved my problem after changing the schema of my parquet file. The original schema had all the properties as mandatory, so I changed to optional and then the Parquet library could deserialize it.

Maybe there's a problem in the library when trying to serialize a column and it's declared as mandatory in the parquet, because even declaring all the properties as non-null in a class, the library rises this exception up

aloneguid commented 6 months ago

You need to mark class field with [ParquetRequired] attribute, as in .NET strings can have null values by default. For some reason this was not added to the documentation, but only to release notes.

rachied commented 6 months ago

Can we support such type of schema mismatch without the need of this attribute? Would you welcome PR that facilitates this?

We process parquet files from external parties using a common schema but the nullability can sometimes vary on those fields. The fact that .NET strings can always have null values should not matter when deserializing as I understand it. You'd only need an explicit schema when serialising to parquet format.

aloneguid commented 6 months ago

Fair enough, for deserialisation it shouldn't matter. Feel free to create a PR of course, will merge it asap.