aloneguid / parquet-dotnet

Fully managed Apache Parquet implementation
https://aloneguid.github.io/parquet-dotnet/
MIT License
542 stars 141 forks source link

[BUG]: Unable to deserialize class with Array #456

Closed smolesen closed 5 months ago

smolesen commented 5 months ago

Library Version

4.22.0

OS

Windows

OS Architecture

64 bit

How to reproduce?

  1. Create a class with an array of int or double
  2. Serilize the class into a memorystream using: await ParquetSerializer.SerializeAsync(new List { new EmptyArrayClass { Name = "SomeName", Values = new [] { 0 } } }, stream);
  3. Update position in memorystream to 0
  4. Deserialize stream into object, this will throw a 'Value cannot ne null'

Failing test

public class EmptyArrayClass
   {
       public string Name { get; set; }
       public int[] Values { get; set; }
   }

   [Test]
   public async Task Deserialize_Empty_Array()
   {
       var stream = new MemoryStream();
       await ParquetSerializer.SerializeAsync(new List<EmptyArrayClass> { new EmptyArrayClass { Name = "SomeName", Values = new [] { 0 } } }, stream);
       stream.Position = 0;
       var result = ParquetSerializer.DeserializeAsync<EmptyArrayClass>(stream);
   }
aloneguid commented 5 months ago

I appreciate your thorough report, it makes the bug fixing process much easier. I'm working on it right now and I'll keep you updated on the progress.

aloneguid commented 5 months ago

@smolesen as a workaround you can replace int[] with List<int> as arrays are not supported by serializer yet.

smolesen commented 5 months ago

Yeah, I tried that, it just gives me a lot of problems in the rest of the code.... of course I could deserialize as List and copy it to another object with int[] BTW: It used to work ... in version: 3.8.6 /Søren

aloneguid commented 5 months ago

@smolesen in 3.8.6 it was extremely inefficient and only worked for root-level arrays :(

smolesen commented 5 months ago

Aha I see...

I have a lot of data written with 3.8.6, do you think I'll be able to deserialize it, if I use List or will I have to convert it first?

aloneguid commented 5 months ago

Yes, but depending on how it was written you might need to mark arrays as legacy (if they are), see https://aloneguid.github.io/parquet-dotnet/serialisation.html#legacy-repeatable-legacy-arrays

aloneguid commented 5 months ago

@smolesen the complexity of deserializing into array is we don't know beforehand the size of the array, so deserialize cannot pre-allocate correct size. V3 was only operating on simple properties and was deserializing arrays into list of lists first, and then converting to arrays. So it will be more efficient to use List<T> in the beginning.

The only workaround in deserializer i'm thinking of is allocating empty array, then resizing it every time a new element comes in, but that might quickly put a lot of pressure on GC.

aloneguid commented 5 months ago

so it's something like this for each element:

image
smolesen commented 5 months ago

Ok, think I found a workable way to use List<> instead, so I'll stop using array.

Thanks for looking into it though....

aloneguid commented 5 months ago

Latest version also supports arrays of primitives btw if you still need it.

smolesen commented 5 months ago

Great and thanks....