aloneguid / parquet-dotnet

Fully managed Apache Parquet implementation
https://aloneguid.github.io/parquet-dotnet/
MIT License
542 stars 140 forks source link

Support for serialization of records with no default constructor #510

Open danielearwicker opened 1 month ago

danielearwicker commented 1 month ago

C# 9 added record, which formalises a pattern where a type has a "primary constructor" whose parameters names and types exactly match the names and types of its public properties, which are init-only. This implies a corresponding serialisation/deserialisation pattern, where serialisation reads the public properties and deserialisation calls the primary constructor.

This PR extends Parquet.NET to support serialising/deserialising records, as well as ordinary classes that have no default constructor and follow the same pattern as records in the naming of constructor parameters.

Altering the Parquet.NET serialisation code to support this pattern directly is next to impossible, because it necessarily reads whole columns of a row-group and updates the corresponding properties of objects that have already been allocated, i.e. it needs objects with write-enabled properties.

But if each record type R had a corresponding placeholder type P that had the necessary properties, deserialisation could perform a first pass that constructs a set of P, and then a second pass that constructs each R from a P. The drawback of this is that it implies a lot of additional allocation for large row-groups.

But there is a solution that avoids this: we can use R itself as the placeholder type. The CLR provides a way to allocate an instance of a type without yet calling its constructor (call this a "pre-constructed" object). Its fields/properties will have default values, exactly as they do at the start of the constructor. So a pre-constructed R can be safely serialised into.

The second pass executes the constructors on all the R types in the row-group, passing the property values into the primary constructor. A ConstructorInfo can be called via reflection exactly like an instance method, running the constructor on a pre-constructed object, so no new object is allocated. The parameters are re-assigned to the properties that already have those values, which is unavoidable, but necessary because the primary constructor of a record can contain additional user-defined code to initialise fields from the parameter values (see tests in this PR).

To make this fast, code-generation can be used, as it is in existing Parquet.NET serialisation. The Expression-based approach has a limitation: it can't invoke a constructor on a pre-constructed object. But IL-generation (like reflection) has no such limitation. So a "post-constructor" operation can be generated for a type.

If a type has a default (no-params) constructor, that constructor continues to be used and the type does not require post-construction.

Even so, a type may contain nested type references within it (e.g. a property that is a list of records), and this case must also be handled by generated code that visits records nested within the hierarchy.

If the type's full hierarchy does not contain any types requiring post-construction, the post-constructor operation is generated as a no-op. This should be the case for all existing client code of Parquet.NET.

Until now serialisation methods have constraint the type with T : new(). This restriction is removed in this PR, but a suitably relaxed check is performed at runtime.

Note that recursive types (e.g. tree of nodes) cannot be serialised, but the code-gen could be enhanced to allow this if required.

aloneguid commented 1 month ago

I think this is great but I need to think about it and maybe postpone till v5. Records are not a natural fit yet due to limitations you have mentioned, but I'd love to suppor tthis.

danielearwicker commented 1 month ago

I understand, it's fairly hefty bit of new code and a quirky way of constructing objects.

In the meantime I have some simple methods (also using code gen) in my own codebase that work with records, but it can only do simple properties, i.e. they don't implement Dremel at all. It would be great to have a single serialize/deserialize system that fully supports hierarchical data and works with immutable records as well.