artiso-solutions / CoVoX

MIT License
1 stars 1 forks source link

Normalize input/target #55

Closed tommasobertoni closed 3 years ago

tommasobertoni commented 3 years ago

Due to upper/lower cases and punctuation, the matching of input and targets may vary quite a bit:

var target = "Turn on the Light.";

Test("Turn on the Light.");
Test("Turn on the Light");
Test("Turn on the light.");
Test("Turn on the light");
Test("turn on the light");

void Test(string input)
{
    var score = new CosineSimilarityInterpreter().CalculateMatchScore(target, input);
    Console.WriteLine($"Compare ( '{target}' '{input}' )\t= {score}");
}

Results:

Compare ( 'Turn on the Light.' 'Turn on the Light.' )   = 1

// missing final period
Compare ( 'Turn on the Light.' 'Turn on the Light' )    = 0.9733285267845753 // drops ~3%

// lower case work
Compare ( 'Turn on the Light.' 'Turn on the light.' )   = 0.8947368421052629 // drops ~11%

// lower case and no period
Compare ( 'Turn on the Light.' 'Turn on the light' )    = 0.8651809126974003 // drops ~14%

// all lowercase without punctuation
Compare ( 'Turn on the Light.' 'turn on the light' )    = 0.8111071056538127 // drops ~19%

We should normalize input and target, by evaluating lower case and punctuation-less values.

tommasobertoni commented 3 years ago

@kczornik can you re-run the attached test with the new implementation? are they all matching to 100%?

kczornik commented 3 years ago

I can confirm, it all returns 100%

Results:

Compare ( 'Turn on the Light.' 'Turn on the Light.' )   = 1

// missing final period
Compare ( 'Turn on the Light.' 'Turn on the Light' )    = 1

// lower case work
Compare ( 'Turn on the Light.' 'Turn on the light.' )   = 1

// lower case and no period
Compare ( 'Turn on the Light.' 'Turn on the light' )    = 1

// all lowercase without punctuation
Compare ( 'Turn on the Light.' 'turn on the light' )    = 1
tommasobertoni commented 3 years ago

@kczornik let's also trim the values, so initial and final whitespaces are excluded. e.g. "Turn on the Light.  ", "  Turn on the Light.", "  Turn on the Light.  ".