BeRo1985 / flre

FLRE - Fast Light Regular Expressions - A fast light regular expression library
GNU Lesser General Public License v2.1
94 stars 23 forks source link

searchMatch matching " Bis " to "\s*Titel\s*" #63

Open benibela opened 3 years ago

benibela commented 3 years ago

When I load the text #10#32#32#32#32#32#32#32#32#32#32#32#32#32#32#32#32#32#32#32#66#105#115#10#32#32#32#32#32#32#32#32#32#32#32#32#32#32#32#32#32#32#32 ("Bis" with whitespace) from an HTML file and match it to \s*Titel\s*, I get a match:

$ wine ~/hg/programs/internet/xidel/xidel.exe \
  z:\\home\\benito\\hg\\programs\\internet\\VideLibri\\_meta\\tests\\\\aDISWeb\\list_orders_munich.html \
 -e 'let $x := (//text()[contains(.,"Bis")]) 
      return matches($x => string(), "\s*Titel\s*", "im")'
true

However, when I just match it without the HTML file, I do not get a match:

$ wine ~/hg/programs/internet/xidel/xidel.exe \
   z:\\home\\benito\\hg\\programs\\internet\\VideLibri\\_meta\\tests\\\\aDISWeb\\list_orders_munich.html \
  -e 'x:cps((10, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 66, 105, 115, 10, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32)) => string-join("") => matches(  "\s*Titel\s*", "im")'
false

(flags case-insensitive and single line, so the regex becomes really big according to DumpRegularExpression ((?:[\x09-\x0d ]|\xc2\xa0|\xe1(?:\x9a\x80|\xa0\x8e)|\xe2(?:\x80[\x80-\x8b\xa8-\xa9\xaf]|\x81\x9f)|\xe3\x80\x80|\xef(?:\xbb\xbf|\xbf\xbe))*[Bb][Ii][Bb][Ll][Ii][Oo][Tt][Hh][Ee][Kk]|[Aa][Uu][Ss][Gg][Aa][Bb][Ee][Oo][Rr][Tt]|[Zz][Ww][Ee][Ii][Gg][Ss][Tt][Ee][Ll][Ll][Ee](?:[\x09-\x0d ]|\xc2\xa0|\xe1(?:\x9a\x80|\xa0\x8e)|\xe2(?:\x80[\x80-\x8b\xa8-\xa9\xaf]|\x81\x9f)|\xe3\x80\x80|\xef(?:\xbb\xbf|\xbf\xbe))*))

Anyways, the strings are the same in either case, so it does not have anything to do with the HTML file...

If I look at TFLRE.SearchMatch, it takes the branch fifDFAReady in InternalFlags, case .. DFAMatch:, UnanchoredStart with StartPosition = 0, UntilExcludingPosition = 43, MatchEnd = 92. Then it takes the next exit, with MatchBegin = 64, MatchEnd = 92

Which should be an impossible match, because the string is only 43 bytes long, should it not?

It calls TFLREDFA.SearchMatchFast, and it happens with and without #58.

But when I remove all the assembly and so it uses the Pascal TFLREDFA.SearchMatchFast, it works without any problem (naturally, it only affects the 32-bit build).