dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.04k stars 1.89k forks source link

Decode and roundtrip Timestamp fields from Arrow into DataFrame. #7260

Open vthemelis opened 1 month ago

vthemelis commented 1 month ago

Is your feature request related to a problem? Please describe. At the moment, it looks like it's not possible to read RecordBatches of Timestamps:

using Apache.Arrow;
using Microsoft.Data.Analysis;

var batch = new RecordBatch.Builder()
    .Append("TimestampColumn", false, new TimestampArray.Builder().AppendRange(Enumerable.Repeat(DateTimeOffset.Now, 10)).Build())
    .Build();

DataFrame.FromArrowRecordBatch(batch);

gives:

Unhandled exception. System.NotImplementedException: timestamp
   at Microsoft.Data.Analysis.DataFrame.AppendDataFrameColumnFromArrowArray(Field field, IArrowArray arrowArray, DataFrame ret, String fieldNamePrefix)
   at Microsoft.Data.Analysis.DataFrame.FromArrowRecordBatch(RecordBatch recordBatch)
   at Program.<Main>$(String[] args)

Describe the solution you'd like The above should pass and read the RecordBatch into a DateTimeOffset type.

Tried with:

dotnet add package Microsoft.Data.Analysis --version 0.21.1

With the current pre-release version

dotnet add package Microsoft.Data.Analysis --version 0.22.0-preview.24378.1

the above passes but doesn't roundtrip:

using Apache.Arrow;
using Microsoft.Data.Analysis;

var batch = new RecordBatch.Builder()
    .Append("TimestampColumn", false, new TimestampArray.Builder().AppendRange(Enumerable.Repeat(DateTimeOffset.Now, 10)).Build())
    .Build();

var df = DataFrame.FromArrowRecordBatch(batch);

var newBatch = df.ToArrowRecordBatches();

Console.WriteLine($"Original datatype: {batch.Schema.GetFieldByIndex(0).DataType}");
Console.WriteLine($"Final datatype: {newBatch.First().Schema.GetFieldByIndex(0).DataType}");
> dotnet run
Original datatype: Apache.Arrow.Types.TimestampType
Final datatype: Apache.Arrow.Types.Date64Type
vthemelis commented 1 month ago

Related to https://github.com/dotnet/machinelearning/pull/6871