dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
8.91k stars 1.86k forks source link

DataFrame doesn't decode boolean arrays correctly from Arrow #7115

Open johnnyg opened 2 months ago

johnnyg commented 2 months ago

System Information (please complete the following information):

Describe the bug Creating a dataframe from an arrow record batch where a column is a boolean array produces incorrect results (and occasionally even throws exceptions).

To Reproduce Run:

public void DataFrameDecodeBooleans()
{
    BooleanArray boolArray = new BooleanArray.Builder().AppendNull().Append(false).Append(true).Build();
    Field boolField = new Field("col", BooleanType.Default, true);
    Schema schema = new Schema(new[] { boolField }, null);
    RecordBatch batch = new RecordBatch(schema, new IArrowArray[] { boolArray }, boolArray.Length);
    DataFrame df = DataFrame.FromArrowRecordBatch(batch);
    DataFrameColumn boolCol = df["col"];
    Assert.Equal(boolArray.Length, boolCol.Length);
    for (int row = 0; row < boolArray.Length; row++)
    {
        Assert.Equal(boolArray.GetValue(row), boolCol[row]);
    }
}

Expected behavior Above test passes

asmirnov82 commented 2 months ago

@ericstj @luisquintanilla @michaelgsharp I investigated the defect and it appeared that it is related to the same issue as was discussed in #7094. Since version 1.0 Apache Arrow uses bitmap for storing boolean values (https://issues.apache.org/jira/browse/ARROW-8788). In the DataFrame BooleanDataFrameColumn still inherits from PrimitiveDataFrameColumn and uses 1 byte to store each boolean.
During conversion DataFrame just copies memory. By default Apache Arrow allocates memory by 64 byte blocks, so for RecordBatches with less than 64 rows there is enough memory in the internal buffer and only incorrect values are fetched during access to converted column. But for RecordBatches with more than 64 rows - ArgumentOutOfRangeException is thrown on attempt to convert internal memroy buffer to the ReadOnlySpan with higher length.

So I see 2 possible solutions:

1) Introduce different implementation for BooleanDataFrameColumn, that uses 1 bit for each boolean (this will require to change API and make PrimitiveDataFrameColumn class abstract for preventing users to create instances of PrimitiveDataFrameColumn) 2) Use extra memory allocation and values convertion while DataFrame <-> Apache Arrow interop operations

I may take this task into development. Please suggest what approach is preferred?

ericstj commented 2 months ago

It sounds like option 1 is preferable. Is the only side-effect a breaking change that makes PrimitiveDataFrameColumn abstract? Do we know if anyone would actually be trying to create instances of it? cc @eerhardt

eerhardt commented 2 months ago

Do we know if anyone would actually be trying to create instances of it?

Looks like here is one. https://github.com/microsoft/sql-server-language-extensions/blob/e51909ed3748e5bccae9dcee31f8e4b4a903c7ff/language-extensions/dotnet-core-CSharp/src/managed/CSharpInputDataSet.cs#L152

We haven't officially "1.0"d this library yet right? Since BooleanDataFrameColumn is basically completely broken in this mode, what about just changing BooleanDataFrameColumn to no longer inherit from PrimitiveDataFrameColumn<bool> and manage its own storage, similar to how strings are handled?

Or introducing a new PrimitiveDataFrameColumnBase<T> class that both BooleanDataFrameColumn and PrimitiveDataFrameColumn<T> inherit from? The base type would have the operations defined on it, so people could still use it in a generic fashion, but it wouldn't force the storage to be in terms of T.

We had to do something similar in Arrow where

asmirnov82 commented 2 months ago

Is the only side-effect a breaking change that makes PrimitiveDataFrameColumn abstract?

Another side effect is performance. New implementation working with individual bits will be slower, what I think is fine taking into account 8 times benefit in memory usage.

Except cases, when Boolean column is used in API, for example in filtering and cloning, like in DataFrame public DataFrame Filter(PrimitiveDataFrameColumn<bool> filter) and in DataFrameColumn public DataFrameColumn Clone(DataFrameColumn mapIndices = null, bool invertMapIndices = false, long numberOfNullsToAppend = 0)

I think the next step can be rethinking of this API as it's already quite slow (due to BooleanDataFrameColumn uses bit operations when accessing validitybitmap) (see #6164) and memory consuming. Also current implementation is not straightforward and incorrectly works with null values (#6820).

For example, new API can be implemented as extension over IDataView: public static IDataView Filter<TSource>(this IDataView view, string columnName, Func<TSource, bool> func)

So, instead of

var boolFilter = df["timestamp"].ElementwiseGreaterThanOrEqual(unixStartTime);
var hourlydata = df.Filter(boolFilter);
var boolFilter2 = hourlydata["timestamp"].ElementwiseLessThan(unixEndTime);
hourlydata = hourlydata.Filter(boolFilter2);

more simple code can be used:

var hourlydata = df.Filter("timestamp", x => x >= unixStartTime && x < unixEndTime).ToDataFrame();
JakeRadMSFT commented 2 months ago

Isn't the way we handle nulls basically doing the 1bit for each Boolean thing?

Would we be able to use the new BooleanDataFrameColumn to be our implementation for our nulls?

asmirnov82 commented 2 months ago

Jake, you are right, currently we store null information in validity buffer using 1 bit per value - that is the reason why using BooleanDataFrameColumn takes more time, that just using the list of boolean values (as we also have to deal with this validity buffer) - and, to my mind, should be avoided in filtering API. New BooleanDataFrame column will have 2 bitmaps - first for validity and second for actual values, similar way as Apache Arrow BooleanArray.

Agree, that extracting some code from PrimitiveColumnContainer<T> to works with NullBitMapBuffers into separate Bitmap class instead of using BitUtility directly may help to reuse some logic and avoid unnecessary code duplication