dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
8.91k stars 1.86k forks source link

Inconsistent dataframe columns API for Clone functionality #7092

Closed asmirnov82 closed 2 months ago

asmirnov82 commented 3 months ago

1) API of Clone method for String dataframe columns is inconsistent with other columns

It's possible to have code like this:

PrimitiveDataFrameColumn<int> column = new PrimitiveDataFrameColumn<int>("Column name");
PrimitiveDataFrameColumn<int> cloneColumn = column.Clone();

However

StringDataFrameColumn column = StringDataFrameColumn("Column name");
StringDataFrameColumn cloneColumn = column.Clone();

or

ArrowStringDataFrameColumn column = ArrowDataFrameColumn("Column name");
ArrowStringDataFrameColumn cloneColumn = column.Clone();

fails to compile as Clone method returns base DataFrameColumn class

2) User is able to create inheritor of DataFrameColumn without overriding Clone() method. Adding such columns to DataFrame brokes it functionality at runtime, as it will crashes with NotImplemented exception. Code should force user to implement all required essential methods

3) StringDataFrameColumn, ArrowStringDataFrameColumn and VBufferDataFrameColumn provides their own private Clone and CloneImplementation methods, so it's really hard to understand what method would be actually called, it very error prone. For example, PrimitiveDataFrameColumn provides:

public PrimitiveDataFrameColumn<T> Clone(PrimitiveDataFrameColumn<long> mapIndices = null, bool invertMapIndices = false)

and also it inherits from the base class

public virtual DataFrameColumn Clone(DataFrameColumn mapIndices = null, bool invertMapIndices = false, long numberOfNullsToAppend = 0)

so the call to Clone() without providing any arguments looks ambiguous

The fact, that in addition it has

public PrimitiveDataFrameColumn<T> Clone(IEnumerable<long> mapIndices)

taking into account, that PrimitiveDataFrameColumn implements IEnumerable and inherits from DataFrameColumn makes the situation even worse

4) InvertMapIndices argument doesn't have any sence in case if mapIndices is null or not provided. So ability to call method Clone(invertMapIndices: true) is logicaly incorrect