tikigonzo commented 3 months ago

Writing a new subset method that takes in an indicator list and a new training/testing split that takes in a SplitRatio for split. Replicating the R implementation for kNN and could use these methods for other ML methods.

https://www.geeksforgeeks.org/k-nn-classifier-in-r-programming/

HadenSmith commented 3 months ago

Splitting data in this manner can be done by sampling integers (array indexes) without replacement.

Here is an example from the RMC-TotalRisk software:

Public Module ExtensionMethods

''' <summary>
''' Returns an array of non-repeating random integers between a min and max value. 
''' </summary>
''' <param name="random">A random number generator.</param>
''' <param name="minValue">The minimum value to sample between.</param>
''' <param name="maxValue">The maximum value to sample between.</param>
''' <param name="length">The number of samples to return.</param>
<Extension()>
Public Function NextNRIntegers(ByVal random As Random, ByVal minValue As Integer, ByVal maxValue As Integer, ByVal length As Integer) As Integer()
    ' Create full list of possible integers
    Dim integers = New List(Of Integer)()
    For i As Integer = minValue To maxValue - 1
        integers.Add(i)
    Next

    ' Sample integers without replacement
    Dim values = New Integer(length - 1) {}
    For i As Integer = 0 To length - 1
        Dim r As Integer = random.Next(0, integers.Count)
        values(i) = integers(r)
        integers.RemoveAt(r)
    Next

    Return values
End Function

End Module

HadenSmith commented 3 months ago

This is in Visual Basic, but you can convert it to C# and add it to Extensions in Numerics.

Here is the R example:

Splitting data into train and test data

split <- sample.split(iris, SplitRatio = 0.7) train_cl <- subset(iris, split == "TRUE") test_cl <- subset(iris, split == "FALSE")

Let's say we have an array of 100 observations. If the split ratio is 0.7, then: int N = 100; int subSampleN = (int)Math.Floor(0.7*N);

var rand = new Random(12345); var indices = rand.NextNRIntegers(0, N, subSampleN);

Now you have an array of indices you can use to divide the data into training or testing subsamples.

tikigonzo commented 3 months ago

Thank you for the suggestion. I have been stuck trying to implement the R code sample.split and supplement it with a couple of new Subset methods. My problem was that I could not figure out how to randomize the overall indices without messing with the column's data (i.e. Iris dataset column of values with 4 features and the species name. Since my data has the same number of columns, I can implement this split and then iterate through the data and divide that way.

Currently, to try to keep the data organized as possible, I created a dictionary and then split. I'm trying to debug the Test_kNN.cs so I don't have to change the input parameters Matrix/Vector in kNN.cs.

Here is my code:

///

/// Splits the data into a training and testing set 70/30 respectively. ///

/// Original data or data table we want to split. /// The training set in a dictionary. public static Dictionary<string, List> SplitDataTrain(Dictionary<string, List> data) { var random = new Random(); var train = new Dictionary<string, List>(); var test = new Dictionary<string, List>();

foreach (var dataClass in data)
{
    train[dataClass.Key] = dataClass.Value.ToList();
    test[dataClass.Key] = new List<double>();

    for (int i = 0; i<Math.Ceiling(0.3*dataClass.Value.Count); i++)
    {
        if (train[dataClass.Key].Count == 0) break;
        int idx = random.Next(train[dataClass.Key].Count);
        test[dataClass.Key].Add(train[dataClass.Key][idx]);
        train[dataClass.Key].RemoveAt(idx);
    }
}
return train;

}

HadenSmith commented 3 months ago

You’re on the right track. But I think you will want to sample without replacement to truly make it work as desired.

Random.Next will sample with replacement, so you can get repeat indices and double count data in the training set, etc.

That code I sent will sample without repeats.

You can add another method for “leave one out” cross validation down the road as well.

Good work, keep at it!

Haden Smith, P.E. Lead Engineer | Risk Management Center 12596 W. Bayaud Ave. Suite 400, Lakewood, CO 80228 Office: 303-963-4575 Cell: 901-569-8480

[Github Logo - Free social media icons]https://github.com/USArmy-Corps-of-Engineers-RMC Download RMC Software and Documentation from GitHubhttps://github.com/USArmy-Corps-of-Engineers-RMC [Understanding the Importance of Email Icons - blog ...@.> Email @*.**@*.***> for software inquiries

From: tikigonzo @.> Sent: Tuesday, July 23, 2024 10:30 AM To: USArmy-Corps-of-Engineers-RMC/Numerics @.> Cc: Smith, C Haden CIV (USA) @.>; Comment @.> Subject: [Non-DoD Source] Re: [USArmy-Corps-of-Engineers-RMC/Numerics] Extension Methods - Need new helper method for ML (Issue #52)

Thank you for the suggestion. I have been stuck trying to implement the R code sample.split and supplement it with a couple of new Subset methods. My problem was that I could not figure out how to randomize the overall indices without messing with the column's data (i.e. Iris dataset column of values with 4 features and the species name. Since my data has the same number of columns, I can implement this split and then iterate through the data and divide that way.

Currently, to try to keep the data organized as possible, I created a dictionary and then split. I'm trying to debug the Test_kNN.cs so I don't have to change the input parameters Matrix/Vector in kNN.cs.

Here is my code:

///

/// Splits the data into a training and testing set 70/30 respectively. /// /// Original data or data table we want to split. /// The training set in a dictionary. public static Dictionary<string, List> SplitDataTrain(Dictionary<string, List> data) { var random = new Random(); var train = new Dictionary<string, List>(); var test = new Dictionary<string, List>();

foreach (var dataClass in data)

{

train[dataClass.Key] = dataClass.Value.ToList();

test[dataClass.Key] = new List<double>();

for (int i = 0; i<Math.Ceiling(0.3*dataClass.Value.Count); i++)

{

    if (train[dataClass.Key].Count == 0) break;

    int idx = random.Next(train[dataClass.Key].Count);

    test[dataClass.Key].Add(train[dataClass.Key][idx]);

    train[dataClass.Key].RemoveAt(idx);

}

}

return train;

}

— Reply to this email directly, view it on GitHubBlockedhttps://github.com/USArmy-Corps-of-Engineers-RMC/Numerics/issues/52#issuecomment-2245703337, or unsubscribeBlockedhttps://github.com/notifications/unsubscribe-auth/A5R3FIR5GBOZNE3ZOWODI7TZN2AJJAVCNFSM6AAAAABLAT25V2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENBVG4YDGMZTG4. You are receiving this because you commented.Message ID: @.***>

HadenSmith commented 2 months ago

I added new "RandomSubset" methods for 1D and 2D arrays as well as Vector and Matrix classes. The behavior is similar to the methods used in R.

USArmy-Corps-of-Engineers-RMC / Numerics

Extension Methods - Need new helper method for ML #52

Splitting data into train and test data