dotnet / corefxlab

This repo is for experimentation and exploring new ideas that may or may not make it into the main corefx repo.
MIT License
1.46k stars 345 forks source link

Prevent DataFrame.Sample() method from returning duplicated rows #2939

Closed RamonWill closed 4 years ago

RamonWill commented 4 years ago

Issue

The Sample method in DataFrame (code here) does not check if an index was already generated by rand. Most of the time I get duplicate rows because of it.

Solution I have amended the Sample method to implement the Fisher-Yates shuffle so that the sample returned is unique and still random. Additions also include a new string resource so that an exception is throw if the sample size requested is greater than the number of rows. Tests for row uniqueness and the exception being thrown have been included also.

Kind Regards, Ramon Fixes: #2806