ISEL-HGU / SPI_3.0

0 stars 0 forks source link

LCE/LCS Limits in the current implementation of the calculation of the number of incorrect vectors and the multiplier for the heuristic weight #41

Open young170 opened 2 weeks ago

young170 commented 2 weeks ago

Original code of calculateTotalIncorrectVectors():

private int calculateTotalIncorrectVectors(int[] sigma) {
        int incorrectVectorCount = 0;

        for (int i = 1; i < sigma.length - 1; i++) {
            incorrectVectorCount += sigma[i];
        }

        return incorrectVectorCount;
    }

i starts at 1 and ends at .length - 2, removing the end indices from the process.

Later on, when the LCS similarity score is calculated in scoreSimilarity(), a difference in one index(and no difference in length) gives the vector a score of 0.0.

public float scoreSimilarity(int[] poolVector, int[] givenBugVector) {

        float score = 1;
        int[] subSequence = backtrack(longestCommonSubsequenceOfIntegerArray(poolVector, givenBugVector));

        // Calculate the count of incorrect vectors in the pool for each element in the DP table
        int[] sigma = calculateSigma(poolVector, subSequence);

        // Calculate the total count of incorrect vectors in the pool
        int incorrectVectorCount = calculateTotalIncorrectVectors(sigma);

        // Heuristic to give weight according to the incorrect vectors
        if (poolVector.length != subSequence.length)
            score = 1 - (float) incorrectVectorCount / (poolVector.length - subSequence.length);

        // Adjust the score based on the length of the DP table
        score = score * ((float) subSequence.length / givenBugVector.length);

        return score;
    }

A single difference makes incorrectVectorCount = 1, which makes the equation: 1 - 1 = 0.

 score = 1 - (float) incorrectVectorCount / (poolVector.length - subSequence.length);
young170 commented 2 weeks ago

Proposed update:

For the case of calculateTotalIncorrectVectors(), removing any indices can be considered as an inaccurate representation of the differences between vectors. Including these indices will ensure the score reflects the entire sequence.

private int calculateTotalIncorrectVectors(int[] sigma) {
    int incorrectVectorCount = 0;

    for (int i = 0; i < sigma.length; i++) {
        incorrectVectorCount += sigma[i];
    }

    return incorrectVectorCount;
}

For the heuristic weight in scoreSimilarity(), using the ratio of incorrect vectors to the total number of vectors makes more sense. Instead of the original, which was the ratio of incorrect vectors to the difference in length of the vectors. Also, instead of using the variable score twice, created a separate weight variable.

public float scoreSimilarity(int[] poolVector, int[] givenBugVector) {

        float score = 1;
        float weight = 1;
        int[] subSequence = backtrack(longestCommonSubsequenceOfIntegerArray(poolVector, givenBugVector));

        // Calculate the count of incorrect vectors in the pool for each element in the DP table
        int[] sigma = calculateSigma(poolVector, subSequence);

        // Calculate the total count of incorrect vectors in the pool
        int incorrectVectorCount = calculateTotalIncorrectVectors(sigma);

        // Heuristic to give weight according to the incorrect vectors
        weight = 1 - ((float) incorrectVectorCount / poolVector.length);

        // Adjust the score based on the length of the DP table
        score = weight * ((float) subSequence.length / givenBugVector.length);

        return score;
    }

This update fulfills the goal of lowering the score of cases where there exists mulitple incorrect vectors, and having a lesser effect on vectors with a few incorrect vectors. Also, when the vectors are identical, weight is 1 which will ultimately give the vector a score of 1.0.