fslaborg / Deedle

Easy to use .NET library for data and time series manipulation and for scientific programming
http://fslab.org/Deedle/
BSD 2-Clause "Simplified" License
924 stars 196 forks source link

DropSparseRows incorrectly returns 0 #513

Closed ddemland closed 3 years ago

ddemland commented 3 years ago

I am using the NuGet package in VS 2019 version 2.2.0 and the following code returns a 0 for the DropSparseRows call:

// Load the data into a data frame var dataPath = Path.Combine(dataDirPath, "train.csv"); Console.WriteLine("Loading {0}\n", dataPath); var houseDF = Frame.ReadCsv( dataPath, hasHeaders: true, inferTypes: true ); var dd = houseDF["GarageArea"].ValuesAll.ToArray(); var dd2 = houseDF.DropSparseRows()["GarageArea"].ValuesAll.ToArray();

However, when I revert the package to version 1.2.5 I get 1460 records for the dd variable and 1114 for the dd2 variable which seems to be correct. The attached file is the one I am using in the code.

train.txt

zyzhu commented 3 years ago

I took a quick look. Column PoolQC in your file is NA for all rows.

Deedle uses the same CSV inference from FSharp.Data. I assume that implementation changed a bit so that this column is inferred to be missing right now. Hence the current output is correct. You may consider drop this column if you don't have values or fill missing with something before you drop sparse rows.

ddemland commented 3 years ago

Thank you that worked, I do not need the column so I dropped it and now everything is working. Thank you. You can closed this issue.