aloneguid / parquet-dotnet

Fully managed Apache Parquet implementation
https://aloneguid.github.io/parquet-dotnet/
MIT License
541 stars 140 forks source link

Serialization of interface typed members #513

Open Pragmateek opened 1 month ago

Pragmateek commented 1 month ago

Issue description

Hello,

Is there a way to serialize types whose any member is interface typed? I get a NotImplementedException in MakeField.

I think I understand the logical issue: in a set of objects we could have different implementations hence Parquet columns/schemas. But in the case there is a single implementation is there a way to force the serialization?

Thanks in advance,

Mickael

using Parquet.Serialization;

//using MemoryStream stream = new();
//await ParquetSerializer.SerializeAsync([new LineOK ()], stream);

using MemoryStream stream = new();
await ParquetSerializer.SerializeAsync([new LineKO()], stream);

interface IPoint
{
    decimal X { get; set; }
    decimal Y { get; set; }
}

class Point : IPoint
{
    public decimal X { get; set; }
    public decimal Y { get; set; }
}

class LineOK
{
    public Point Start { get; } = new();
    public Point End { get; } = new();
}

class LineKO
{
    public IPoint Start { get; } = new Point();
    public IPoint End { get; } = new Point();
}
aloneguid commented 1 month ago

Hi. Thanks a lot for the contribution!

You can serialize interfaces, but they should be explicitly set in the IEnumerable parameter. And to deserialize, you need to specify the result class type. Here is an example:

  interface IInterface {
      int Id { get; set; }
  }

  class InterfaceImpl : IInterface {
      public int Id { get; set; }
  }

  [Fact]
  public async Task Interface_Serialize() {

      // explicit IInterface[] type, but elements can be any subtype
      var data = new IInterface[] {
          new InterfaceImpl { Id = 1 },
          new InterfaceImpl { Id = 2 },
      };

      using var ms = new MemoryStream();
      await ParquetSerializer.SerializeAsync(data, ms);

      ms.Position = 0;
      // explicit subtype
      IList<InterfaceImpl> data2 = await ParquetSerializer.DeserializeAllAsync<InterfaceImpl>(ms).ToArrayAsync();

      Assert.Equivalent(data, data2);

  }
Pragmateek commented 1 month ago

My pleasure.

Thanks for your answer but my issue is not with the type of the root objects but one of their properties.

To follow up with your model if Id is typed as an interface IId:

using Parquet.Serialization;
using System.Collections;

var data = new IInterface[] {
          new InterfaceImpl { Id = new IdImpl{ Id = 1 } },
          new InterfaceImpl { Id = new IdImpl { Id = 2 } },
      };

using var ms = new MemoryStream();
await ParquetSerializer.SerializeAsync(data, ms);

ms.Position = 0;
// explicit subtype
var data2 = await ParquetSerializer.DeserializeAsync<InterfaceImpl>(ms);

interface IId
{
    int Id { get; set; }
}

class IdImpl : IId
{
    public int Id { get; set; }
}

interface IInterface
{
    IId Id { get; set; }
}

class InterfaceImpl : IInterface
{
    public IId Id { get; set; }
}
aloneguid commented 1 month ago

I see! Sorry for misunderstanding. I've made a fix for this, which allows to serialize interface properties, so this works:

 interface IInterface {
     int Id { get; set; }
 }

 interface IRootInterface {
     IInterface Child { get; set; }
 }

 class InterfaceImpl : IInterface {
     public int Id { get; set; }
 }

 class RootInterfaceImpl : IRootInterface {
     public IInterface Child { get; set; } = new InterfaceImpl();
 }

var data = new IRootInterface[] {
    new RootInterfaceImpl { Child = new InterfaceImpl { Id = 1 } },
    new RootInterfaceImpl { Child = new InterfaceImpl { Id = 2 } },
};

using var ms = new MemoryStream();
await ParquetSerializer.SerializeAsync(data, ms);

Deserialization won't work though, as we don't know the implementation type of IInterface, but you can use the following workaround:

class ConcreteRootInterfaceImpl {
    public InterfaceImpl Child { get; set; } = new InterfaceImpl();
}

IList<ConcreteRootInterfaceImpl> data2 = await ParquetSerializer.DeserializeAllAsync<ConcreteRootInterfaceImpl>(ms).ToArrayAsync();

Please try -pre.3 version from nuget.

Pragmateek commented 1 month ago

Thanks for the fix, indeed serialization is now working with -pre.4 version.

It would be great if we could configure or hook into the deserialization process to tell the serializer which implementation to use when we know it. Because, even if sometimes we only need to serialize, in order to check that the serialization is OK we must do a roundtrip test with deserialization to ensure that the input and output objects are the same. Maybe something similar to DI container like ParquetSerializer.For<ISomething>().Use<Something>().

aloneguid commented 4 weeks ago

Agreed, I'll sort this out soon.

Pragmateek commented 3 weeks ago

Ideally the choice of the implementation to use would be dynamic by calling into a callback that could decide based on the other properties of the Parquet record. So it means that properties with concrete types should be deserialized first to give the callback the opportunity to leverage them to determine the correct implementation to use.