XQuery test suite false negative

benibela commented 8 years ago

The XQTS knows the following regexs that should match, but do not:

f := TFLRE.Create('(124|864|377|3)', []);
f.ExtractAll('abracadabra-abracadabra-3',matches) ;

f := TFLRE.Create('^$', [rfMULTILINE]);
f.ExtractAll(LineEnding+'abcd'+LineEnding+'defg',matches) ;

f := TFLRE.Create('^$', [rfMULTILINE]);
f.ExtractAll(LineEnding+'abcd'+LineEnding+LineEnding+'defg',matches) ;

benibela commented 8 years ago

The Kelvin sign K should match k or K in case insensitive mode

And there should be the following Unicode classes: \P{C} \P{Cc} \P{L} \p{Lu} \P{Lu} \P{M} \P{Me} \P{N} \P{No} \P{P} \P{Pf} \P{S} \P{Sk} \P{Z} \P{Zs}

benibela commented 8 years ago

And some more that should have matched

<test id="p86" regex="^" input="abc" result="y"/>
<test id="p87" regex="$" input="abc" result="y"/>
<test id="p205" regex="a*" input="" result="y"/>
<test id="p220" regex="(abc|)ef" input="abcdef" result="y"/>
<test id="p282" regex="([a-c]*)\1" input="abcabc" result="y"/>
<test id="p294" regex="(a)|\1" input="x" result="y"/>
<test id="p295" regex="(?:(b)?a)\1" input="a" result="y"/>
<test id="p341" regex="^" flags="i" input="ABC" result="y"/>
<test id="p342" regex="$" flags="i" input="ABC" result="y"/>
<test id="p394" regex="(a+|b){0,1}?" flags="i" input="AB" result="y"/>
<test id="p398" regex="a*" flags="i" input="" result="y"/>
<test id="p409" regex="(abc|)ef" flags="i" input="ABCDEF" result="y"/>
<test id="p489" regex="^(){3,5}" input="abc" result="y"/>
<test id="p662" regex="$" input="a\nb\n" result="y"/>
<test id="p665" regex="$" input="b\na\n" result="y"/>
<test id="p668" regex="$" input="b\na" result="y"/>
<test id="p671" regex="$" flags="m" input="a&#xA;b&#xA;" result="y"/>
<test id="p674" regex="$" flags="m" input="b&#xA;a&#xA;" result="y"/>
<test id="p677" regex="$" flags="m" input="b&#xA;a" result="y"/>
<test id="p908" regex="(\w)?(abc)\1b" input="abcab" result="y"/>
<test id="p955" regex="(.*)\d+\1" input="abc12bc" result="y"/>
<test id="p995" regex="^.{3,4}(.+)\1$" input="foobarbar" result="y"/>
<test id="p996" regex="^(?:f|o|b){3,4}(.+)\1$" input="foobarbar" result="y"/>
<test id="p997" regex="^.{3,4}((?:b|a|r)+)\1$" input="foobarbar" result="y"/>
<test id="p998" regex="^(?:f|o|b){3,4}((?:b|a|r)+)\1$" input="foobarbar" result="y"/>
<test id="p999" regex="^.{3,4}(.+?)\1$" input="foobarbar" result="y"/>
<test id="p1000" regex="^(?:f|o|b){3,4}(.+?)\1$" input="foobarbar" result="y"/>
<test id="p1001" regex="^.{3,4}((?:b|a|r)+?)\1$" input="foobarbar" result="y"/>
<test id="p1002" regex="^(?:f|o|b){3,4}((?:b|a|r)+?)\1$" input="foobarbar" result="y"/>
<test id="p1003" regex="^.{2,3}?(.+)\1$" input="foobarbar" result="y"/>
<test id="p1004" regex="^(?:f|o|b){2,3}?(.+)\1$" input="foobarbar" result="y"/>
<test id="p1005" regex="^.{2,3}?((?:b|a|r)+)\1$" input="foobarbar" result="y"/>
<test id="p1006" regex="^(?:f|o|b){2,3}?((?:b|a|r)+)\1$" input="foobarbar" result="y"/>
<test id="p1007" regex="^.{2,3}?(.+?)\1$" input="foobarbar" result="y"/>
<test id="p1008" regex="^(?:f|o|b){2,3}?(.+?)\1$" input="foobarbar" result="y"/>
<test id="p1009" regex="^.{2,3}?((?:b|a|r)+?)\1$" input="foobarbar" result="y"/>
<test id="p1010" regex="^(?:f|o|b){2,3}?((?:b|a|r)+?)\1$" input="foobarbar" result="y"/>
<test id="p1016" regex="(WORDS|WORLD|WORD)S" input="WORDS" result="y"/>
<test id="p1035" regex="X(A|B||C|D)Y" input="XXXYYY" result="y"/>
<test id="p1042" regex="^(XXXXXXXXXX|YYYYYYYYYY|Z.Q*X|Z[TE]Q*P):" input="ZEQQQQQQQQQQQQQQQQQQP:" result="y"/>
<test id="p1043" regex="^(XXXXXXXXXX|YYYYYYYYYY|Z.Q*X|Z[TE]Q*P):" input="ZEQQQX:" result="y"/>
<test id="p1044" regex="^([TUV]+|XXXXXXXXXX|YYYYYYYYYY|Z.Q*X|Z[TE]Q*P):" input="ZEQQQQQQQQQQQQQQQQQQP:" result="y"/>
<test id="p1045" regex="^([TUV]+|XXXXXXXXXX|YYYYYYYYYY|Z.Q*X|Z[TE]Q*P):" input="ZEQQQX:" result="y"/>
<test id="p1046" regex="^([TUV]+|XXXXXXXXXX|YYYYYYYYYY|Z.Q*X|Z[TE]Q*P|[MKJ]):" input="ZEQQQQQQQQQQQQQQQQQQP:" result="y"/>
<test id="p1047" regex="^([TUV]+|XXXXXXXXXX|YYYYYYYYYY|Z.Q*X|Z[TE]Q*P|[MKJ]):" input="ZEQQQX:" result="y"/>
<test id="p1048" regex="^(XXX|YYY|Z.Q*X|Z[TE]Q*P):" input="ZEQQQQQQQQQQQQQQQQQQP:" result="y"/>
<test id="p1049" regex="^(XXX|YYY|Z.Q*X|Z[TE]Q*P):" input="ZEQQQX:" result="y"/>
<test id="p1050" regex="^([TUV]+|XXX|YYY|Z.Q*X|Z[TE]Q*P):" input="ZEQQQQQQQQQQQQQQQQQQP:" result="y"/>
<test id="p1051" regex="^([TUV]+|XXX|YYY|Z.Q*X|Z[TE]Q*P):" input="ZEQQQX:" result="y"/>
<test id="p1052" regex="^([TUV]+|XXX|YYY|Z.Q*X|Z[TE]Q*P|[MKJ]):" input="ZEQQQQQQQQQQQQQQQQQQP:" result="y"/>
<test id="p1053" regex="^([TUV]+|XXX|YYY|Z.Q*X|Z[TE]Q*P|[MKJ]):" input="ZEQQQX:" result="y"/>
<test id="p1228" regex="(WORDS|WORLD|WORD)S" input="WORDS" result="y"/>
<test id="p1236" regex="(WORDS|WORLD|WORD)+S" input="WORDS" result="y"/>
<test id="p1601" regex="^\p{L}" input="㐀" result="y"/>
<test id="p1684" regex="^(foo|)bar$" input="bar" result="y"/>
<test id="p1685" regex="^(foo||baz)bar$" input="bar" result="y"/>
<test id="p1689" regex="^(?:foo|)bar$" input="bar" result="y"/>
<test id="p1690" regex="^(?:foo||baz)bar$" input="bar" result="y"/>
<test id="p1705" regex="^(.)(?:(.)+)*[BX]" input="ABCDE" result="y"/>

BeRo1985 commented 8 years ago

The ^$ stuff should work now. Please retest it.

And back-references are only partially supported, because full support for irregular expression back-references does need a backtracking algorithm, and FLRE avoids backtracking as much as possible. More details see this followed article series (whereupon FLRE was also modeled): https://swtch.com/~rsc/regexp/regexp1.html https://swtch.com/~rsc/regexp/regexp2.html and https://swtch.com/~rsc/regexp/regexp3.html

And please provide complete but minimal test case source codes (in single dpr or pas files), so that I can test&debug it quickly.

benibela commented 8 years ago

The ^$ stuff should work now. Please retest it.

Now it complains that it matches too much

And back-references are only partially supported, because full support for irregular expression

It is not really that irregular.

Seems mostly to be that if the capture group is optional and does not match anything in the input, the backreference should match the empty string. For that you could even remove both of them during the expansion

And please provide complete but minimal test case source codes (in single dpr or pas files), so that I can test&debug it quickly.

here

github: "Unfortunately, we don’t support that file type. " ...

here as png flrematches

benibela commented 8 years ago

And the list of missing Unicode classes: \p{C} \p{Cc} \p{Cf} \p{Cn} \p{Co} \p{IsAlphabeticPresentationForms} \p{IsArabic} \p{IsArmenian} \p{IsArrows} \p{IsBasicLatin} \p{IsBengali} \p{IsBlockElements} \p{IsBopomofo} \p{IsBopomofoExtended} \p{IsBoxDrawing} \p{IsBraillePatterns} \p{IsByzantineMusicalSymbols} \p{IsCherokee} \p{IsCJKCompatibility} \p{IsCJKCompatibilityForms} \p{IsCJKCompatibilityIdeographs} \p{IsCJKCompatibilityIdeographsSupplement} \p{IsCJKRadicalsSupplement} \p{IsCJKSymbolsandPunctuation} \p{IsCJKUnifiedIdeographs} \p{IsCJKUnifiedIdeographsExtensionA} \p{IsCJKUnifiedIdeographsExtensionB} \p{IsCombiningDiacriticalMarks} \p{IsCombiningHalfMarks} \p{IsCombiningMarksforSymbols} \p{IsControlPictures} \p{IsCurrencySymbols} \p{IsCyrillic} \p{IsDeseret} \p{IsDevanagari} \p{IsDingbats} \p{IsEnclosedAlphanumerics} \p{IsEnclosedCJKLettersandMonths} \p{IsEthiopic} \p{IsGeneralPunctuation} \p{IsGeometricShapes} \p{IsGeorgian} \p{IsGothic} \p{IsGreek} \p{IsGreekExtended} \p{IsGujarati} \p{IsGurmukhi} \p{IsHalfwidthandFullwidthForms} \p{IsHangulCompatibilityJamo} \p{IsHangulJamo} \p{IsHangulSyllables} \p{IsHebrew} \p{IsHighSurrogates} \p{IsHiragana} \p{IsIdeographicDescriptionCharacters} \p{IsIPAExtensions} \p{IsKanbun} \p{IsKangxiRadicals} \p{IsKannada} \p{IsKatakana} \p{IsKhmer} \p{IsLao} \p{IsLatinExtendedAdditional} \p{IsLetterlikeSymbols} \p{IsLowSurrogates} \p{IsMalayalam} \p{IsMathematicalAlphanumericSymbols} \p{IsMathematicalOperators} \p{IsMiscellaneousSymbols} \p{IsMiscellaneousTechnical} \p{IsMongolian} \p{IsMusicalSymbols} \p{IsMyanmar} \p{IsNumberForms} \p{IsOgham} \p{IsOldItalic} \p{IsOpticalCharacterRecognition} \p{IsOriya} \p{IsPrivateUse} \p{IsRunic} \p{IsSinhala} \p{IsSmallFormVariants} \p{IsSpacingModifierLetters} \p{IsSpecials} \p{IsSuperscriptsandSubscripts} \p{IsSyriac} \p{IsTags} \p{IsTamil} \p{IsTelugu} \p{IsThaana} \p{IsThai} \p{IsTibetan} \p{IsUnifiedCanadianAboriginalSyllabics} \p{IsYiRadicals} \p{IsYiSyllables} \p{L} \p{Ll} \p{Lm} \p{Lo} \p{Lt} \p{Lu} \p{M} \p{Mc} \p{Me} \p{Mn} \p{N} \p{Nd} \p{Nl} \p{No} \p{P} \p{Pc} \p{Pd} \p{Pe} \p{Pf} \p{Pi} \p{Po} \p{Ps} \p{S} \p{Sc} \p{Sk} \p{Sm} \p{So} \p{Z} \p{Zl} \p{Zp} \p{Zs}

BeRo1985 commented 8 years ago

The Unicode classes were not missing, instead the code for the \p{} parsing had just a bug, which I've fixed now.

benibela commented 8 years ago

Nice. It still needs \p{IsBasicLatin} and p{IsCombiningMarksforSymbols}

And the test cases want unassigned characters to match (e.g. \u0530 for \p{IsArmenian}), but perhaps it is better, if they do not match

benibela commented 8 years ago

Is there an option to make the escapes independent of the case sensitivity? So \p{Lu} only matches upper case letters, even if matched case insensitive.

BeRo1985 commented 8 years ago

Nice. It still needs \p{IsBasicLatin} and p{IsCombiningMarksforSymbols}

And the test cases want unassigned characters to match (e.g. \u0530 for \p{IsArmenian}), but perhaps it is better, if they do not match

The InBasicLatin parsing bug is fixed now. And I'll add IsCombiningMarksforSymbols later.

But the other stuff I couldn't find in the offical Unicode 8.0 resource data (eg. in DerivedGeneralCategory.txt and UnicodeData.txt of the Unicode ftp server). The whole FLRE unicode support stuff is automatically converted from these offical Unicode data files, see src\Unicode\FLREConvertUnicode.dpr for the OfficalUnicodeResourceData->FLREData-converter-code.

Is there an option to make the escapes independent of the case sensitivity? So \p{Lu} only matches upper case letters, even if matched case insensitive.

Hm just use (?-i)\p{Lu}(?i) ? :)

benibela commented 8 years ago

But the other stuff I couldn't find in the offical Unicode 8.0 resource data (eg. in DerivedGeneralCategory.txt and UnicodeData.txt of the Unicode ftp server).

What other stuff?

Armenian is there. Range, 1328;ToChar:1423. And \p{IsArmenian} works for all the assigned code points https://en.wikipedia.org/wiki/Armenian_%28Unicode_block%29 except 0x589 (Armenian fullstop).

Hm just use (?-i)\p{Lu}(?i) ? :)

That would make sense. Guess a preprocessor could output that, even if it would not allow it as input

But can this work with backreferences? I.e.(s(?-i)\p{Lu}(?i)t)\1 should still match sUtsut

benibela commented 8 years ago

armenian.txt

benibela commented 8 years ago

0x589 (Armenian fullstop).

Perhaps because that is in the Armerian block, but the Common script

BeRo1985 commented 8 years ago

But can this work with backreferences? I.e.(s(?-i)\p{Lu}(?i)t)\1 should still match sUtsut

Since FLRE avoids back-tracking as possible, FLRE is emulating back-references by cloning sub byte-code instruction sequences of capturing groups to hidden internal capture groups and adding a hidden-internal-capturing-group-comparison-verification-operation-instruction, where hidden internal capturing groups have their own internal count-numbering.

For example: (s[uU]t)\1 will transformed and compiled internally into (s[uU]t)(?hidden-internal-capture-group:clone-capturing-group-1:s[uU]t)(?hidden-internal-verify-operation:is-capture-group-1-equal-to-hidden-internal-capture-group-1) .

This is necessary, because, as already said, FLRE avoids backtracking as possible (except at the BitstateNFA algorithm for short and simple non-complex regexs), and the Thompson-NFA (Parallel-Threaded-NFA) algorithm can emulate backreferences only in that hacky way, since the Thompson-NFA (Parallel-Threaded-NFA) algorithm in FLRE does never backtrack instead it processes every input char exactly once. And the DFA code in FLRE sees (s[uU]t)\1 just only as s[uU]ts[uU]t and leaves it to the non-backtracking *NFA algorithms the final verification.

Full real support for back-references would do need a full back-tracking-algorithm implementation, and back-tracking can be very slow and is even too vulnerable to ReDoS attacks (Regular expression Denial of Service, see https://www.owasp.org/index.php/Regular_expression_Denial_of_Service_-_ReDoS and https://en.wikipedia.org/wiki/ReDoS and even https://swtch.com/~rsc/regexp/regexp1.html ), therefore FLRE avoids back-tracking as possible, and this consequently FLRE has only emulated partial hacky support for back-references.

And as a small side-note: the Google RE2 regexp engine (which was the role model for FLRE) doesn't support back-references at all for these reasons.

Any questions left regarding the back-references topic? :)

benibela commented 8 years ago

Since FLRE avoids back-tracking as possible, FLRE is emulating back-references by cloning sub byte-code instruction sequences of capturing groups to hidden internal capture groups and adding a hidden-internal-capturing-group-comparison-verification-operation-instruction, where hidden internal capturing groups have their own internal count-numbering.

Sounds complicated

Perhaps it could remove the ?i when cloning the group?

and back-tracking can be very slow

I do not really need the speed. It just needs to pass the test cases that have require this specific syntax.

BeRo1985 commented 8 years ago

My suggestion (and my offer for you) would be a syntax like (s(?-i)\p{Lu}(?i)t)(?P=1:s\p{Lu}t) or something like that, where (?P=x:y) (as a extension of the normal alternative (?P=x) backreference syntax) would be a compare-to-another-capturing-group capturing-group, where x = compare-with-capturing-group-index-or-named-group-name and y = manual-by-hand cloned but modified regex. Anything else would mean more effort and complicate the code, I think.

benibela commented 8 years ago

that would be great

benibela commented 8 years ago

Another one that should match (using the matches function from the above png):

  matches('abc, foobar','(^|) *abc *(,|$)')

BeRo1985 commented 8 years ago

(?i)(s(?-i)\p{Lu}(?i)t)(?P=1:s\p{Lu}t) and (^|) *abc *(,|$) should work now. I've improved the beginning-and-ending-text-anchor-detection for to disable some speed optimizations then, and that had fixed at least (^|) *abc *(,|$).

benibela commented 8 years ago

Thanks. Now I need to figure out how to combine it with my preprocessor

benibela commented 8 years ago

If there are parenthesis inside the (?P=..) does that create new capturing groups?

Tried to test it, but found this instead:

 matches('aaxyy', '([a-c])\1(x)(y)\3', '');
 matches('aAxyy', '([a-c])\1(x)(y)\3', 'i');

BeRo1985 commented 8 years ago

If there are parenthesis inside the (?P=..) does that create new capturing groups?

Hm yes, you can use here non-capturing-groups (?:) for example (?:\w+)

Tried to test it, but found this instead:

matches('aaxyy', '([a-c])\1(x)(y)\3', ''); matches('aAxyy', '([a-c])\1(x)(y)\3', 'i');

Should fixed now in the lastest commit.

benibela commented 8 years ago

The InBasicLatin parsing bug is fixed now.

It was IsBasicLatin in this test

Although it seems it gives both. Some even have \p{script=...} or \p{block=...}

benibela commented 8 years ago

You are aware that it is still failing the abracadabra-abracadabra-3 case, etc. ?

Reino17 commented 8 years ago

Hello BeRo1985,

I'm one of Xidel's users; benibela's project that seems to using your FLRE-library. He's dead set on updating Xidel's regular expression engine before fixing other bugs and releasing another update at all, but relies on you to do so. If you would fix benibela's reported issues, besides helping benibela you would also help me in the process. Thanks!

benibela commented 8 years ago

Perhaps i should close this general issue and open new ones for every sub issue.

Might be easier to handle

benibela commented 8 years ago

f := TFLRE.Create('(124|864|377|3)', []); f.ExtractAll('abracadabra-abracadabra-3',matches) ;

It fails with most common prefixes and none of six seven twenty thirty forty fifty sixty are matched.

benibela commented 8 years ago

Or with dates jan|january|feb|february|mar|march|apr|april|may|may|jun|june|jul|july|aug|august|sep|september|oct|october|nov|november|dec|december

Does not match any of mar jun jul

pyscripter commented 6 years ago

See issue #44. Commenting out the Prefix factoring resolves the disjunction issues above, i,e

f := TFLRE.Create('(124|864|377|3)', []);
f.ExtractAll('abracadabra-abracadabra-3',matches)

and the months work.

benibela commented 3 years ago

This issue was too big. It is hard to see which test cases have been fixed and which ones are still failing.

I have opened a separate issue for each group of test cases: #65 #67 #68 #69 #70 #72 #73 #74

BeRo1985 / flre

XQuery test suite false negative #10