dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
8.91k stars 1.86k forks source link

Add support for Apache.Arrow.Types.Decimal128Type #7094

Open piyushdubey opened 3 months ago

piyushdubey commented 3 months ago

Fixes #7082

piyushdubey commented 3 months ago

@asmirnov82 Can you please help review this? Additionally, I'd love your comments on what the difference between Decimal128 and Decimal256 Arrow type handling would be in a DataFrame?

codecov[bot] commented 3 months ago

Codecov Report

Attention: Patch coverage is 0% with 9 lines in your changes are missing coverage. Please review.

Project coverage is 68.47%. Comparing base (8b483f4) to head (b00af0e). Report is 3 commits behind head on main.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #7094 +/- ## ========================================== - Coverage 68.82% 68.47% -0.35% ========================================== Files 1255 1262 +7 Lines 250358 254272 +3914 Branches 25550 26237 +687 ========================================== + Hits 172310 174117 +1807 - Misses 71438 73473 +2035 - Partials 6610 6682 +72 ``` | [Flag](https://app.codecov.io/gh/dotnet/machinelearning/pull/7094/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | Coverage Δ | | |---|---|---| | [Debug](https://app.codecov.io/gh/dotnet/machinelearning/pull/7094/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `68.47% <0.00%> (-0.35%)` | :arrow_down: | | [production](https://app.codecov.io/gh/dotnet/machinelearning/pull/7094/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `62.87% <0.00%> (-0.39%)` | :arrow_down: | | [test](https://app.codecov.io/gh/dotnet/machinelearning/pull/7094/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `88.56% <ø> (+0.04%)` | :arrow_up: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#carryforward-flags-in-the-pull-request-comment) to find out more. | [Files](https://app.codecov.io/gh/dotnet/machinelearning/pull/7094?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | Coverage Δ | | |---|---|---| | [src/Microsoft.Data.Analysis/DataFrame.Arrow.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7094?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5EYXRhLkFuYWx5c2lzL0RhdGFGcmFtZS5BcnJvdy5jcw==) | `86.18% <0.00%> (-5.43%)` | :arrow_down: | ... and [64 files with indirect coverage changes](https://app.codecov.io/gh/dotnet/machinelearning/pull/7094/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet)
asmirnov82 commented 3 months ago

@asmirnov82 Can you please help review this? Additionally, I'd love your comments on what the difference between Decimal128 and Decimal256 Arrow type handling would be in a DataFrame?

As far as I understand initialy Data.Analysis.DataFrame was designed to support Apache.Arrow format and allow constracting columns from Arrow array without memory copying (and least for reading, on any attempt to edit data – Arrow buffers are copied anyway).

But at some point new columns were added to the DataFrame that didn’t follow this rule. For example StringDataFrameColumn, DateTimeDataFrameColumn and DecimalDataFrameColumn. These columns have different from Arrow format of underlying data.

For Arrow strings DataFrame provides alternative implementation - ArrowStringDataFrameColumn (but it’s immutable). DateTimeDataFrameColumn converts Arrow data to .Net DateTime struct (the same approch that you use in this PR). As on any attempt to edit data coping is happen anyway I don't see any drawback in this.

One possible issue that we have to pay attention is that Arrow Decimal128 uses 4 bytes for store data and user is able to configure precision and scale. As far as I know .Net Decimal has no concept of precision. It uses 3 DWORDs (12 bytes) to store the actual data, and therefore has a maximum precision of 28 (not 34). So converting Arrow Decimal128 to .Net decimal may happen with information loss

As according to Decimal256 implementation – I see 2 possibilities: using SqlDecimal from System.Data.SqlTypes namespace or waiting support of Decimal256 by .Net framework. There is a proposal for supporting new IEEE compliant decimal types - https://github.com/dotnet/runtime/issues/81376

@luisquintanilla @JakeRadMSFT @ericstj could you please provide your thoughts?

luisquintanilla commented 3 months ago

As according to Decimal256 implementation – I see 2 possibilities: using SqlDecimal from System.Data.SqlTypes namespace or waiting support of Decimal256 by .Net framework. There is a proposal for supporting new IEEE compliant decimal types - dotnet/runtime#81376

I don't think that proposal is something that's being actively worked on at this time, so probably not something that would happen in the .NET 9 timeframe.

ericstj commented 3 months ago

But at some point new columns were added to the DataFrame that didn’t follow this rule. For example StringDataFrameColumn, DateTimeDataFrameColumn and DecimalDataFrameColumn. These columns have different from Arrow format of underlying data. So converting Arrow Decimal128 to .Net decimal may happen with information loss

That's worthwhile to call out in the docs. We should update the API docs here: https://github.com/dotnet/machinelearning/blob/19fb80580118e850493168a6c9928a0135d793e4/src/Microsoft.Data.Analysis/DataFrame.Arrow.cs#L151 https://github.com/dotnet/machinelearning/blob/19fb80580118e850493168a6c9928a0135d793e4/src/Microsoft.Data.Analysis/DataFrame.Arrow.cs#L171

FWIW it does look like Arrow has precedent for this conversion, though also offers SqlDecimal. Neither can be done with zero-copy, but the latter wouldn't lose precision (I think): https://github.com/apache/arrow/blob/2babda0ba22740c092166b5c5d5d7aa9b4797953/csharp/src/Apache.Arrow/Arrays/Decimal128Array.cs#L42-L63

Another option would be to expose some types unique to this assembly that's meant to represent this data without copying. Not saying we should - but that's a possibility. Probably those types would be better to live in the Arrow project if at all.

@tannergooding and @eerhardt might have more thoughts on this.

tannergooding commented 3 months ago

If there's need for the IEEE 754 Decimal128 type purely as an interchange type, that is itself feasible to add in .NET 9

It just isn't in scope to add the entirety of the approved API surface area, including operators.

CurtHagenlocher commented 3 months ago

FWIW it does look like Arrow has precedent for this conversion, though also offers SqlDecimal. Neither can be done with zero-copy, but the latter wouldn't lose precision (I think): https://github.com/apache/arrow/blob/2babda0ba22740c092166b5c5d5d7aa9b4797953/csharp/src/Apache.Arrow/Arrays/Decimal128Array.cs#L42-L63

Yes, SqlDecimal wouldn't lose precision for Arrow's decimal128, but that's at the cost of occupying 160 bits instead of 128 . The IEEE Decimal128 isn't wide enough to hold the 38 digits of decimal precision supported by Arrow, though it is (as noted) better than the CLR decimal. (I think there were some backwards-compatibility considerations in CLR decimal for OLEDB.)

There's an interesting tension between traditional database representations of decimal values and the requirements of general-purpose programming languages. In a database, you're always operating in the context of a particular column so you can factor out properties like precision and scale to optimize storage. In a programming language, individual values need to carry that context with them because most type systems aren't designed to support that kind of parameterization. So you either end up needing extra bytes (like SqlDecimal) or potentially losing precision (like IEEE Decimal128) when working with database decimals. And because most database decimal values don't actually need 38 digits of precision, the latter often ends up looking more attractive.

I like to think that if you squint a little, you can almost see the historical split between FORTRAN and COBOL (and their associated use cases) in this topic.

One nice thing about incorporating precision and scale into the type is that the values are automatically normalized. This is super important in a database representation because comparisons and simple computation (like sums) are going to be the majority of operations performed on these values.

Anyway, sorry for the digression. Representation of numbers in a computer is a fun topic.