Closed IanMayo closed 11 months ago
Are you still intending to look for more patterns for when this is failing? Or should I go ahead and change the logic for how we check?
And, if I do change the logic, do you mean if any 30 successive characters anywhere in the string match then we count it as a match? I suspect that's actually significantly more computationally intensive to do, as we'd have to loop over every possible 30-char long substring and check each one until we find one that matches. Is that worth it?
For the 30 chars, I thought we could simplify the test - by taking a random 30 char long block of text from the string, and if they are also present in the target then it's valid. Matching a longer string seems to invite more chances for false negatives (through different whitespace or the presence of special characters).
I've just pulled the value of 30 out of the air, since it's longer than Characteristics
or some long word that could occur multiple times.
I won't get chance to look for a deeper understanding of why I'm getting false negatives for a few days - but if you're ok with reducing the length of the text being compared, then that itself may reduce the false negatives.
The file-checker is failing files, but the target content is actually present.
The tester reports:
This paragraph of text is present in the published dita. The text looks identical, including
chars, and erroneous double spaces between words.To avoid the above false-error, I think we should trim the block of text. I guess if the target contains, say, 30 matching chars successive chars from the source then it is valid.
I'll come back tomorrow and see if I can spot a pattern.
Aah, there is something. In the source html for another file there is a
°
marker, but in the dita it is a ASCII degree symbol. I guess we should strip these out of both strings before comparing.