USArmy-Corps-of-Engineers-RMC / Numerics

Numerics is a free and open-source library for .NET developed by USACE-RMC, providing a comprehensive set of methods and algorithms for numerical computations and statistical analysis.
Other
21 stars 2 forks source link

kNearestNeighbors-Need guidance #66

Closed tikigonzo closed 2 months ago

tikigonzo commented 3 months ago

Wokring on the Train/Test split via Extension Methods and Subsets using the NextNRIntegers method you sent me. Here is my code for both methods:

public static int[] NextNRIntegers(Random random, int min, int max, int length ) { var integers = new List(); for(int i = min; i < max; i++) { integers.Add(i); }

var vals = new int[length];
for (int i = 0; i < length; i++)
{
    var r = random.Next(0,integers.Count);
    vals[i] = integers[r];
    integers.RemoveAt(r);
}
return vals;

}

///

/// /// /// Random number generator for the indices using NextNRIntegers() /// /// /// public static double[][] TrainTestSplit(int[] rng,int dataSize, double[][] data,bool testing = false) { // iterate through indices and then split //int dataSize = data[0].Length; //amount of columns (150) int subSampleTraining = (int)Math.Ceiling(0.7 * dataSize); // 70% training split (105) int subSampleTesting = dataSize - subSampleTraining; // 45

// Calling rng for indices with seed
//var rand = new Random(12345);
//var rng = NextNRIntegers(rand, 0, dataSize, dataSize);
var indicesTraining = new int[subSampleTraining]; //70% of the rng indices to training
for (int i = 0; i < subSampleTraining; i++)
{
    indicesTraining[i] = rng[i];
}

var indicesTesting = new int[subSampleTesting];
for (int i = 0; i < subSampleTesting; i++)
{
    indicesTesting[i] = rng[i + subSampleTraining];
}
//var indicesTesting = rng.Except(indicesTraining).ToArray(); // allocates the other indices for testing

var trainingData = new double[subSampleTraining][];
var testingData = new double[subSampleTesting][];

for (int i = 0; i < subSampleTraining; i++)
{
    trainingData[i] = new double[5]; //5 features in the dataset
    for(int j =0; j < 5; j++)
    {
        trainingData[i][j] = data[j][indicesTraining[i]];
        //trainingData[i] = data[indicesTraining[i]];
    }
}

for(int i =0; i < subSampleTesting; i++)
{
    testingData[i] = new double[5];
    for(int j = 0; j < 5; j++)
    {
        testingData[i][j]= data[j][indicesTesting[i]];
        //testingData[i][j] = data[indicesTesting[i]];
    }
}

if (testing)
{
    return testingData;
}
else
{
    return trainingData;
}

}

tikigonzo commented 3 months ago

Now, when testing kNN with the Iris dataset I am erroring out with:

Test method MachineLearning.Test_kNN.Test_kNN_RegressionIrisDataset threw exception: System.ArgumentException: The y vector must be the same length as the x matrix.

My dataset is formatted like your housing data that was used in the test before, using 5 arrays with 4 being the features on 1 being the target. I have a feeling that the kNN algorithm favors rows being the datapoints and columns being the features, but I am not sure given the previous dataset. If this is the case, am I to rewrite the split algorithm to make it runnable in kNN without changing inputs or Argument statements?

HadenSmith commented 2 months ago

I updated the test cases to be apples-to-apples with R