dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
8.91k stars 1.86k forks source link

DataFrame.OrderBy methods incorrect behavior with null values #7102

Closed asmirnov82 closed 2 weeks ago

asmirnov82 commented 3 months ago

DataFrame OrderBy method should always place null values at the bottom of the list (after not nullable values) independently of sorting (ascending or descending). This is how Python does and how DataFrameColumn.Sort method works.

To Reproduce:

var col1 = new Int32DataFrameColumn("Index", new[] { 1, 2, 3, 4, 5, 6, 7, 8, 9 });
var col2 = new StringDataFrameColumn("Country", new[] { "USA", "France", "UK", "Brazil", "Russia", "India", null, "China", null });
var col3 = new StringDataFrameColumn("Capital", new[] { "Washington", "Paris", "London", "Brasilia", "Moscow", "New Dehli", null, "Beijing", null});

var df = new DataFrame(col1, col2, col3);
Console.WriteLine(df.OrderByDescending("Capital"));

Actual behaiour:

Index Country Capital 9 null null 7 null null 1 USA Washington 2 France Paris 6 India New Dehli 5 Russia Moscow 3 UK London 4 Brazil Brasilia 8 China Beijing

Expected behaiour:

Index Country Capital 1 USA Washington 2 France Paris 6 India New Dehli 5 Russia Moscow 3 UK London 4 Brazil Brasilia 8 China Beijing 9 null null 7 null null

Notes:

'Console.WriteLine(new DataFrame([col3.Sort(ascending: false)]));' works correctly

Capital Washington Paris New Dehli Moscow London Brasilia Beijing null null

Issue was already mention in https://github.com/dotnet/machinelearning/pull/5776/files#r624316355