dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.04k stars 1.88k forks source link

DataFrame and BaseColumn should have easier APIs to append new rows #5686

Open eerhardt opened 5 years ago

eerhardt commented 5 years ago

We should have some APIs that allow a developer to easily append new values to DataFrame and to BaseColumn.

BaseColumn today has Resize(Length + 1) and then SetValue(Length, value), but this isn't intuitive to use.

DataFrame should have an InsertRow or AppendRow method that takes a collection of objects to insert.

cc @pgovind

pgovind commented 5 years ago

2730 implements AppendRow on the DataFrame.

I'm not super convinced that we should have an Append on BaseColumn. This is mostly because DataFrameTable holds a list of BaseColumn and does book keeping of RowCount. DataFrameTable also enforces that each BaseColumn has the same RowCount. If BaseColumn could append data to itself, nothing will break at the moment but it may not behave the way folks expect. Consider the following for ex:

BaseColumn intColumn = df["IntColumn"];
intColumn.Append(5);
df.AppendRow({100, 2.0}); // appends 100 to intColumn and 2.0 to floatColumn

When someone asks for the row containing [100, 2.0] by rowIndex, they might get [5, 2.0] instead now. An easy way to fix this would be to append nulls to the remaining columns whenever a BaseColumn.Append is called. That's exactly what DataFrame.AppendRow does, but the intent/behavior is more obvious with AppendRow. So, do we still want an Append on BaseColumn?

Note: What I described above can happen already. PrimitiveColumn and StringColumn have an Append on them. They exist so a user can initially populate a column with data or because they are being used without a DataFrame. Starting from a DataFrame though, a user would have to explicitly cast a BaseColumn to one of those to call Append. At that point, maybe we can assume that users won't be surprised by this behavior? Thoughts?

Another Note: Pandas avoid this by not appending in place. All appends to a column/Series return a new column/Series.

eerhardt commented 5 years ago

Note: What I described above can happen already. PrimitiveColumn and StringColumn have an Append on them. ... a user would have to explicitly cast a BaseColumn to one of those to call Append. At that point, maybe we can assume that users won't be surprised by this behavior?

I don't understand why casting from a BaseColumn down to PrimitiveColumn would change the expectation on behavior.

Another Note: Pandas avoid this by not appending in place. All appends to a column/Series return a new column/Series.

I think that behavior may make sense. As long as there is an AppendRange type of method so I can append many values without creating intermediate column/Series objects.