File checker - reduce size of text comparison

IanMayo commented 11 months ago

The file-checker is failing files, but the target content is actually present.

The tester reports:

Couldn't find source text from xxx.html in target document yyy.dita
Source text:
2.       A mid-life upgrade of these installations was announced in 2011.  ANCHORS was upgraded with a new heat 
pump and ACME waste cleanser.  BRAVO is to be upgraded with NOBLE IOT data system and DRAGO halon drench, 
as well as a new Drago Mills EMS system

This paragraph of text is present in the published dita. The text looks identical, including   chars, and erroneous double spaces between words.

To avoid the above false-error, I think we should trim the block of text. I guess if the target contains, say, 30 matching chars successive chars from the source then it is valid.

I'll come back tomorrow and see if I can spot a pattern.

Aah, there is something. In the source html for another file there is a ° marker, but in the dita it is a ASCII degree symbol. I guess we should strip these out of both strings before comparing.

robintw commented 11 months ago

Are you still intending to look for more patterns for when this is failing? Or should I go ahead and change the logic for how we check?

And, if I do change the logic, do you mean if any 30 successive characters anywhere in the string match then we count it as a match? I suspect that's actually significantly more computationally intensive to do, as we'd have to loop over every possible 30-char long substring and check each one until we find one that matches. Is that worth it?

IanMayo commented 11 months ago

For the 30 chars, I thought we could simplify the test - by taking a random 30 char long block of text from the string, and if they are also present in the target then it's valid. Matching a longer string seems to invite more chances for false negatives (through different whitespace or the presence of special characters).

I've just pulled the value of 30 out of the air, since it's longer than Characteristics or some long word that could occur multiple times.

I won't get chance to look for a deeper understanding of why I'm getting false negatives for a few days - but if you're ok with reducing the length of the text being compared, then that itself may reduce the false negatives.

DeepBlueCLtd / LegacyMan

File checker - reduce size of text comparison #546