apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.68k stars 3.56k forks source link

[C#] Add SetNull functionality to classes #40188

Open phartnett opened 9 months ago

phartnett commented 9 months ago

Describe the enhancement requested

Currently, we find ourselves only able to Set valid values, whereas sometimes we may find ourselves wanting to set a given index as null instead, and adjust the relevant validity index.

The ensuing code implements SetNull on a small set of types, but should get people started.

Component(s)

C#

phartnett commented 9 months ago

https://github.com/apache/arrow/pull/40189

CurtHagenlocher commented 9 months ago

I think you can achieve what you're looking for by using Reserve instead of Resize and growing the builder allocations in chunks. I have a sample program that shows this using a TPC-H table stored in SQL Server. You can see that at https://gist.github.com/CurtHagenlocher/306865d4b4202906470f4f18fd410c4e.

I don't think there's anything wrong with adding SetNull to the builders, but on the whole I don't find them very ergonomical. In my sample code, for instance, I have to resize each array directly because Reserve and Resize are defined on a typed array interface instead of a shared base interface. I imagine this was done to allow "fluent"-style construction and that's not something I find compelling. In any event, backwards compatibility is very important so changing what's there isn't really an option.

An alternative change which might help this scenario is to add constructors to the builders which allow specification of the initial size and then to double the size once the capacity is reached instead of just growing it by one, or to otherwise allow encapsulation of a "grow builder" strategy that's a little more practical than the default.