dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.03k stars 1.89k forks source link

[Proposal] DataFrame Arithmetic and Computation API #6905

Open asmirnov82 opened 11 months ago

asmirnov82 commented 11 months ago

Background and motivation

Current arithmetic and computation API of the DataFrame is inconsistent and quite slow in scenarios where columns of different types are involved as each column casting to different type requires coping the entire column data.

Moreover some of the methods of computation API may produce inaccurate results (due to overflow exception).

Motivation of this change includes

  1. reusing generic operators from future TensorPrimitives API for fast computation over DataFrame columns of different types (and possibly even chaining such operations to avoid intensive memory usage for storing intermediate calculation results)
  2. accurate calculation of aggregation functions
  3. increased performance and using SIMD commands for the significant part of DataFrame operations
  4. reduce DataFrame code duplication

Details of current implementation limitations

  1. Inconsistency of the API is related to types of data returned by arithmetic operations over PrimitiveColumn<T> instances and their concrete aliases (like DoubleDataFrame or Int32DataFrameColumn)

For example

var left_column = new ByteDataFrameColumn("Left", new byte[] { 1, 2, 3 });
var right_column = new Int16DataFrameColumn("Right", new short[] { 1, 2, 3 });

var sum = left_column.Add + right_column;

results in sum column containing int values.

The same code, but referring columns using parent type

PrimitiveDataFrameColumn<byte> left_column = new ByteDataFrameColumn("Left", new byte[] { 1, 2, 3 });
PrimitiveDataFrameColumn<short> right_column = new Int16DataFrameColumn("Right", new short[] { 1, 2, 3 });

var sum = left_column.Add + right_column;

results in sum column containing double values.

  1. Inaccurate results of aggregated functions. Currently aggregated functions are calculated using the same type as input column, for example for byte column it’s byte type. And
    
    var column = new ByteDataFrameColumn("Column", new byte[] { 128, 129 });

var sum = column.Sum(); var mean = column.Mean();


results in sum equals to 1 and mean to 0.5, which is obviously incorrect.

3. Performance and excessive coping issues are well known and mentioned in  #5663  

## Proposal

Proposal is:

1. to use the common numeric type for the arithmetic output (the smallest numeric type which can accommodate any value of any input). If any input is a floating point type the common numeric type is the widest floating point type among the inputs. Otherwise the common numeric type is integral and is signed if any input is signed.

2. to return the widest possible type for accumulating aggregation functions (like sum, product and etc). It’s double for floating point types (double, float, Half), long for signed integers (long, int, short, sbyte) and ulong for unsigned integers (ulong, uint, ushort, byte) and input type for other aggregated functions (like min, max, first and etc).

This approach is used in [Apache Arrow Compute functions](https://arrow.apache.org/docs/cpp/compute.html)

## Implementation

Implementation can reuse the low level parts of the future generic TensorPrimitives API, like `IUnaryOperator<T>`, `IBinaryOperator<T>` and their concrete implementations.

Operators allow to invoke highly efficient vectorized calculations over arguments of the same type. New implementation should extend this functionality by providing efficient vectorized convertors and conversion rules for cases when arguments have different data type.

## Example

I created an [POC](https://github.com/asmirnov82/compute-functions) of possible implementation.
Apache Arrow doesn’t have a .Net implementation of Compute Functions, so I took at as an exercise for the POC. DataFrame follows the same paradigm, so all the ideas can be applied to it as well.

I extend several operators from TensorPrimitive API to have generic version (this is done to illustrate how future Generic Tensor API can be look like and be used). I also created several OperatorExecutors and Functions classes, that provides rules for argument type conversions for several cases (binary arithmetic calculations and two types of aggregating functions). I also created a list of wideners and convertes that leverage SIMD vectorized calculation as an example
davesearle commented 11 months ago

This is great. Could the DataFrame also support streaming operations (similar to how Apache Arrow Acero is architected?). I've recently been looking to implement support for Acero in the C# Arrow client: https://github.com/apache/arrow/pull/37544

asmirnov82 commented 11 months ago

@davesearle DataFrame was designed to handle situations where all the data is in memory, so streaming was not the primary goal. However DataFrame allows convertion to a collection of Arrow RecordBatches (for most of the cases without memory copy) and that's why potentionsly can be used as an input for Acero record_batch_source.