Closed benibela closed 3 years ago
The Kelvin sign K should match k or K in case insensitive mode
And there should be the following Unicode classes: \P{C} \P{Cc} \P{L} \p{Lu} \P{Lu} \P{M} \P{Me} \P{N} \P{No} \P{P} \P{Pf} \P{S} \P{Sk} \P{Z} \P{Zs}
And some more that should have matched
<test id="p86" regex="^" input="abc" result="y"/>
<test id="p87" regex="$" input="abc" result="y"/>
<test id="p205" regex="a*" input="" result="y"/>
<test id="p220" regex="(abc|)ef" input="abcdef" result="y"/>
<test id="p282" regex="([a-c]*)\1" input="abcabc" result="y"/>
<test id="p294" regex="(a)|\1" input="x" result="y"/>
<test id="p295" regex="(?:(b)?a)\1" input="a" result="y"/>
<test id="p341" regex="^" flags="i" input="ABC" result="y"/>
<test id="p342" regex="$" flags="i" input="ABC" result="y"/>
<test id="p394" regex="(a+|b){0,1}?" flags="i" input="AB" result="y"/>
<test id="p398" regex="a*" flags="i" input="" result="y"/>
<test id="p409" regex="(abc|)ef" flags="i" input="ABCDEF" result="y"/>
<test id="p489" regex="^(){3,5}" input="abc" result="y"/>
<test id="p662" regex="$" input="a\nb\n" result="y"/>
<test id="p665" regex="$" input="b\na\n" result="y"/>
<test id="p668" regex="$" input="b\na" result="y"/>
<test id="p671" regex="$" flags="m" input="a
b
" result="y"/>
<test id="p674" regex="$" flags="m" input="b
a
" result="y"/>
<test id="p677" regex="$" flags="m" input="b
a" result="y"/>
<test id="p908" regex="(\w)?(abc)\1b" input="abcab" result="y"/>
<test id="p955" regex="(.*)\d+\1" input="abc12bc" result="y"/>
<test id="p995" regex="^.{3,4}(.+)\1$" input="foobarbar" result="y"/>
<test id="p996" regex="^(?:f|o|b){3,4}(.+)\1$" input="foobarbar" result="y"/>
<test id="p997" regex="^.{3,4}((?:b|a|r)+)\1$" input="foobarbar" result="y"/>
<test id="p998" regex="^(?:f|o|b){3,4}((?:b|a|r)+)\1$" input="foobarbar" result="y"/>
<test id="p999" regex="^.{3,4}(.+?)\1$" input="foobarbar" result="y"/>
<test id="p1000" regex="^(?:f|o|b){3,4}(.+?)\1$" input="foobarbar" result="y"/>
<test id="p1001" regex="^.{3,4}((?:b|a|r)+?)\1$" input="foobarbar" result="y"/>
<test id="p1002" regex="^(?:f|o|b){3,4}((?:b|a|r)+?)\1$" input="foobarbar" result="y"/>
<test id="p1003" regex="^.{2,3}?(.+)\1$" input="foobarbar" result="y"/>
<test id="p1004" regex="^(?:f|o|b){2,3}?(.+)\1$" input="foobarbar" result="y"/>
<test id="p1005" regex="^.{2,3}?((?:b|a|r)+)\1$" input="foobarbar" result="y"/>
<test id="p1006" regex="^(?:f|o|b){2,3}?((?:b|a|r)+)\1$" input="foobarbar" result="y"/>
<test id="p1007" regex="^.{2,3}?(.+?)\1$" input="foobarbar" result="y"/>
<test id="p1008" regex="^(?:f|o|b){2,3}?(.+?)\1$" input="foobarbar" result="y"/>
<test id="p1009" regex="^.{2,3}?((?:b|a|r)+?)\1$" input="foobarbar" result="y"/>
<test id="p1010" regex="^(?:f|o|b){2,3}?((?:b|a|r)+?)\1$" input="foobarbar" result="y"/>
<test id="p1016" regex="(WORDS|WORLD|WORD)S" input="WORDS" result="y"/>
<test id="p1035" regex="X(A|B||C|D)Y" input="XXXYYY" result="y"/>
<test id="p1042" regex="^(XXXXXXXXXX|YYYYYYYYYY|Z.Q*X|Z[TE]Q*P):" input="ZEQQQQQQQQQQQQQQQQQQP:" result="y"/>
<test id="p1043" regex="^(XXXXXXXXXX|YYYYYYYYYY|Z.Q*X|Z[TE]Q*P):" input="ZEQQQX:" result="y"/>
<test id="p1044" regex="^([TUV]+|XXXXXXXXXX|YYYYYYYYYY|Z.Q*X|Z[TE]Q*P):" input="ZEQQQQQQQQQQQQQQQQQQP:" result="y"/>
<test id="p1045" regex="^([TUV]+|XXXXXXXXXX|YYYYYYYYYY|Z.Q*X|Z[TE]Q*P):" input="ZEQQQX:" result="y"/>
<test id="p1046" regex="^([TUV]+|XXXXXXXXXX|YYYYYYYYYY|Z.Q*X|Z[TE]Q*P|[MKJ]):" input="ZEQQQQQQQQQQQQQQQQQQP:" result="y"/>
<test id="p1047" regex="^([TUV]+|XXXXXXXXXX|YYYYYYYYYY|Z.Q*X|Z[TE]Q*P|[MKJ]):" input="ZEQQQX:" result="y"/>
<test id="p1048" regex="^(XXX|YYY|Z.Q*X|Z[TE]Q*P):" input="ZEQQQQQQQQQQQQQQQQQQP:" result="y"/>
<test id="p1049" regex="^(XXX|YYY|Z.Q*X|Z[TE]Q*P):" input="ZEQQQX:" result="y"/>
<test id="p1050" regex="^([TUV]+|XXX|YYY|Z.Q*X|Z[TE]Q*P):" input="ZEQQQQQQQQQQQQQQQQQQP:" result="y"/>
<test id="p1051" regex="^([TUV]+|XXX|YYY|Z.Q*X|Z[TE]Q*P):" input="ZEQQQX:" result="y"/>
<test id="p1052" regex="^([TUV]+|XXX|YYY|Z.Q*X|Z[TE]Q*P|[MKJ]):" input="ZEQQQQQQQQQQQQQQQQQQP:" result="y"/>
<test id="p1053" regex="^([TUV]+|XXX|YYY|Z.Q*X|Z[TE]Q*P|[MKJ]):" input="ZEQQQX:" result="y"/>
<test id="p1228" regex="(WORDS|WORLD|WORD)S" input="WORDS" result="y"/>
<test id="p1236" regex="(WORDS|WORLD|WORD)+S" input="WORDS" result="y"/>
<test id="p1601" regex="^\p{L}" input="㐀" result="y"/>
<test id="p1684" regex="^(foo|)bar$" input="bar" result="y"/>
<test id="p1685" regex="^(foo||baz)bar$" input="bar" result="y"/>
<test id="p1689" regex="^(?:foo|)bar$" input="bar" result="y"/>
<test id="p1690" regex="^(?:foo||baz)bar$" input="bar" result="y"/>
<test id="p1705" regex="^(.)(?:(.)+)*[BX]" input="ABCDE" result="y"/>
The ^$ stuff should work now. Please retest it.
And back-references are only partially supported, because full support for irregular expression back-references does need a backtracking algorithm, and FLRE avoids backtracking as much as possible. More details see this followed article series (whereupon FLRE was also modeled): https://swtch.com/~rsc/regexp/regexp1.html https://swtch.com/~rsc/regexp/regexp2.html and https://swtch.com/~rsc/regexp/regexp3.html
And please provide complete but minimal test case source codes (in single dpr or pas files), so that I can test&debug it quickly.
The ^$ stuff should work now. Please retest it.
Now it complains that it matches too much
And back-references are only partially supported, because full support for irregular expression
It is not really that irregular.
Seems mostly to be that if the capture group is optional and does not match anything in the input, the backreference should match the empty string. For that you could even remove both of them during the expansion
And please provide complete but minimal test case source codes (in single dpr or pas files), so that I can test&debug it quickly.
here
github: "Unfortunately, we don’t support that file type. " ...
here as png
And the list of missing Unicode classes: \p{C} \p{Cc} \p{Cf} \p{Cn} \p{Co} \p{IsAlphabeticPresentationForms} \p{IsArabic} \p{IsArmenian} \p{IsArrows} \p{IsBasicLatin} \p{IsBengali} \p{IsBlockElements} \p{IsBopomofo} \p{IsBopomofoExtended} \p{IsBoxDrawing} \p{IsBraillePatterns} \p{IsByzantineMusicalSymbols} \p{IsCherokee} \p{IsCJKCompatibility} \p{IsCJKCompatibilityForms} \p{IsCJKCompatibilityIdeographs} \p{IsCJKCompatibilityIdeographsSupplement} \p{IsCJKRadicalsSupplement} \p{IsCJKSymbolsandPunctuation} \p{IsCJKUnifiedIdeographs} \p{IsCJKUnifiedIdeographsExtensionA} \p{IsCJKUnifiedIdeographsExtensionB} \p{IsCombiningDiacriticalMarks} \p{IsCombiningHalfMarks} \p{IsCombiningMarksforSymbols} \p{IsControlPictures} \p{IsCurrencySymbols} \p{IsCyrillic} \p{IsDeseret} \p{IsDevanagari} \p{IsDingbats} \p{IsEnclosedAlphanumerics} \p{IsEnclosedCJKLettersandMonths} \p{IsEthiopic} \p{IsGeneralPunctuation} \p{IsGeometricShapes} \p{IsGeorgian} \p{IsGothic} \p{IsGreek} \p{IsGreekExtended} \p{IsGujarati} \p{IsGurmukhi} \p{IsHalfwidthandFullwidthForms} \p{IsHangulCompatibilityJamo} \p{IsHangulJamo} \p{IsHangulSyllables} \p{IsHebrew} \p{IsHighSurrogates} \p{IsHiragana} \p{IsIdeographicDescriptionCharacters} \p{IsIPAExtensions} \p{IsKanbun} \p{IsKangxiRadicals} \p{IsKannada} \p{IsKatakana} \p{IsKhmer} \p{IsLao} \p{IsLatinExtendedAdditional} \p{IsLetterlikeSymbols} \p{IsLowSurrogates} \p{IsMalayalam} \p{IsMathematicalAlphanumericSymbols} \p{IsMathematicalOperators} \p{IsMiscellaneousSymbols} \p{IsMiscellaneousTechnical} \p{IsMongolian} \p{IsMusicalSymbols} \p{IsMyanmar} \p{IsNumberForms} \p{IsOgham} \p{IsOldItalic} \p{IsOpticalCharacterRecognition} \p{IsOriya} \p{IsPrivateUse} \p{IsRunic} \p{IsSinhala} \p{IsSmallFormVariants} \p{IsSpacingModifierLetters} \p{IsSpecials} \p{IsSuperscriptsandSubscripts} \p{IsSyriac} \p{IsTags} \p{IsTamil} \p{IsTelugu} \p{IsThaana} \p{IsThai} \p{IsTibetan} \p{IsUnifiedCanadianAboriginalSyllabics} \p{IsYiRadicals} \p{IsYiSyllables} \p{L} \p{Ll} \p{Lm} \p{Lo} \p{Lt} \p{Lu} \p{M} \p{Mc} \p{Me} \p{Mn} \p{N} \p{Nd} \p{Nl} \p{No} \p{P} \p{Pc} \p{Pd} \p{Pe} \p{Pf} \p{Pi} \p{Po} \p{Ps} \p{S} \p{Sc} \p{Sk} \p{Sm} \p{So} \p{Z} \p{Zl} \p{Zp} \p{Zs}
The Unicode classes were not missing, instead the code for the \p{} parsing had just a bug, which I've fixed now.
Nice. It still needs \p{IsBasicLatin} and p{IsCombiningMarksforSymbols}
And the test cases want unassigned characters to match (e.g. \u0530 for \p{IsArmenian}), but perhaps it is better, if they do not match
Is there an option to make the escapes independent of the case sensitivity? So \p{Lu}
only matches upper case letters, even if matched case insensitive.
Nice. It still needs \p{IsBasicLatin} and p{IsCombiningMarksforSymbols}
And the test cases want unassigned characters to match (e.g. \u0530 for \p{IsArmenian}), but perhaps it is better, if they do not match
The InBasicLatin parsing bug is fixed now. And I'll add IsCombiningMarksforSymbols later.
But the other stuff I couldn't find in the offical Unicode 8.0 resource data (eg. in DerivedGeneralCategory.txt and UnicodeData.txt of the Unicode ftp server). The whole FLRE unicode support stuff is automatically converted from these offical Unicode data files, see src\Unicode\FLREConvertUnicode.dpr for the OfficalUnicodeResourceData->FLREData-converter-code.
Is there an option to make the escapes independent of the case sensitivity? So \p{Lu} only matches upper case letters, even if matched case insensitive.
Hm just use (?-i)\p{Lu}(?i)
? :)
But the other stuff I couldn't find in the offical Unicode 8.0 resource data (eg. in DerivedGeneralCategory.txt and UnicodeData.txt of the Unicode ftp server).
What other stuff?
Armenian is there. Range, 1328;ToChar:1423. And \p{IsArmenian} works for all the assigned code points https://en.wikipedia.org/wiki/Armenian_%28Unicode_block%29 except 0x589 (Armenian fullstop).
Hm just use (?-i)\p{Lu}(?i) ? :)
That would make sense. Guess a preprocessor could output that, even if it would not allow it as input
But can this work with backreferences? I.e.(s(?-i)\p{Lu}(?i)t)\1
should still match sUtsut
0x589 (Armenian fullstop).
Perhaps because that is in the Armerian block, but the Common script
But can this work with backreferences? I.e.(s(?-i)\p{Lu}(?i)t)\1 should still match sUtsut
Since FLRE avoids back-tracking as possible, FLRE is emulating back-references by cloning sub byte-code instruction sequences of capturing groups to hidden internal capture groups and adding a hidden-internal-capturing-group-comparison-verification-operation-instruction, where hidden internal capturing groups have their own internal count-numbering.
For example: (s[uU]t)\1
will transformed and compiled internally into (s[uU]t)(?hidden-internal-capture-group:clone-capturing-group-1:s[uU]t)(?hidden-internal-verify-operation:is-capture-group-1-equal-to-hidden-internal-capture-group-1)
.
This is necessary, because, as already said, FLRE avoids backtracking as possible (except at the BitstateNFA algorithm for short and simple non-complex regexs), and the Thompson-NFA (Parallel-Threaded-NFA) algorithm can emulate backreferences only in that hacky way, since the Thompson-NFA (Parallel-Threaded-NFA) algorithm in FLRE does never backtrack instead it processes every input char exactly once. And the DFA code in FLRE sees (s[uU]t)\1
just only as s[uU]ts[uU]t
and leaves it to the non-backtracking *NFA algorithms the final verification.
Full real support for back-references would do need a full back-tracking-algorithm implementation, and back-tracking can be very slow and is even too vulnerable to ReDoS attacks (Regular expression Denial of Service, see https://www.owasp.org/index.php/Regular_expression_Denial_of_Service_-_ReDoS and https://en.wikipedia.org/wiki/ReDoS and even https://swtch.com/~rsc/regexp/regexp1.html ), therefore FLRE avoids back-tracking as possible, and this consequently FLRE has only emulated partial hacky support for back-references.
And as a small side-note: the Google RE2 regexp engine (which was the role model for FLRE) doesn't support back-references at all for these reasons.
Any questions left regarding the back-references topic? :)
Since FLRE avoids back-tracking as possible, FLRE is emulating back-references by cloning sub byte-code instruction sequences of capturing groups to hidden internal capture groups and adding a hidden-internal-capturing-group-comparison-verification-operation-instruction, where hidden internal capturing groups have their own internal count-numbering.
Sounds complicated
Perhaps it could remove the ?i when cloning the group?
and back-tracking can be very slow
I do not really need the speed. It just needs to pass the test cases that have require this specific syntax.
My suggestion (and my offer for you) would be a syntax like (s(?-i)\p{Lu}(?i)t)(?P=1:s\p{Lu}t) or something like that, where (?P=x:y) (as a extension of the normal alternative (?P=x) backreference syntax) would be a compare-to-another-capturing-group capturing-group, where x = compare-with-capturing-group-index-or-named-group-name and y = manual-by-hand cloned but modified regex. Anything else would mean more effort and complicate the code, I think.
that would be great
Another one that should match (using the matches function from the above png):
matches('abc, foobar','(^|) *abc *(,|$)')
(?i)(s(?-i)\p{Lu}(?i)t)(?P=1:s\p{Lu}t)
and (^|) *abc *(,|$)
should work now. I've improved the beginning-and-ending-text-anchor-detection for to disable some speed optimizations then, and that had fixed at least (^|) *abc *(,|$)
.
Thanks. Now I need to figure out how to combine it with my preprocessor
If there are parenthesis inside the (?P=..) does that create new capturing groups?
Tried to test it, but found this instead:
matches('aaxyy', '([a-c])\1(x)(y)\3', '');
matches('aAxyy', '([a-c])\1(x)(y)\3', 'i');
If there are parenthesis inside the (?P=..) does that create new capturing groups?
Hm yes, you can use here non-capturing-groups (?:) for example (?:\w+)
Tried to test it, but found this instead:
matches('aaxyy', '([a-c])\1(x)(y)\3', ''); matches('aAxyy', '([a-c])\1(x)(y)\3', 'i');
Should fixed now in the lastest commit.
The InBasicLatin parsing bug is fixed now.
It was IsBasicLatin in this test
Although it seems it gives both. Some even have \p{script=...} or \p{block=...}
You are aware that it is still failing the abracadabra-abracadabra-3
case, etc. ?
Hello BeRo1985,
I'm one of Xidel's users; benibela's project that seems to using your FLRE-library. He's dead set on updating Xidel's regular expression engine before fixing other bugs and releasing another update at all, but relies on you to do so. If you would fix benibela's reported issues, besides helping benibela you would also help me in the process. Thanks!
Perhaps i should close this general issue and open new ones for every sub issue.
Might be easier to handle
f := TFLRE.Create('(124|864|377|3)', []); f.ExtractAll('abracadabra-abracadabra-3',matches) ;
Some more observations with one|two|three|four|five|six|seven|eight|nine|ten|eleven|twelve|thirteen|fourteen|fifteen|sixteen|seventeen|eighteen|nineteen|twenty|twenty-one|twenty-two|twenty-three|twenty-four|twenty-five|twenty-six|twenty-seven|twenty-eight|twenty-nine|thirty|thirty-one|thirty-two|thirty-three|thirty-four|thirty-five|thirty-six|thirty-seven|thirty-eight|thirty-nine|forty|forty-one|forty-two|forty-three|forty-four|forty-five|forty-six|forty-seven|forty-eight|forty-nine|fifty|fifty-one|fifty-two|fifty-three|fifty-four|fifty-five|fifty-six|fifty-seven|fifty-eight|fifty-nine|sixty|sixty-one|sixty-two|sixty-three|sixty-four|sixty-five|sixty-six|sixty-seven|sixty-eight|sixty-nine|seventy
It fails with most common prefixes and none of six seven twenty thirty forty fifty sixty
are matched.
And it depends on the order of the alternatives. When it is sorted to descending two|twenty-two|twenty-three|twenty-six|twenty-seven|twenty-one|twenty-nine|twenty-four|twenty-five|twenty-eight|twenty|twelve|three|thirty-two|thirty-three|thirty-six|thirty-seven|thirty-one|thirty-nine|thirty-four|thirty-five|thirty-eight|thirty|thirteen|ten|sixty-two|sixty-three|sixty-six|sixty-seven|sixty-one|sixty-nine|sixty-four|sixty-five|sixty-eight|sixty|sixteen|six|seventy|seventeen|seven|one|nineteen|nine|fourteen|four|forty-two|forty-three|forty-six|forty-seven|forty-one|forty-nine|forty-four|forty-five|forty-eight|forty|five|fifty-two|fifty-three|fifty-six|fifty-seven|fifty-one|fifty-nine|fifty-four|fifty-five|fifty-eight|fifty|fifteen|eleven|eighteen|eight
only thirty
and fifty
are not matched
But sorted ascending to eight|eighteen|eleven|fifteen|fifty|fifty-eight|fifty-five|fifty-four|fifty-nine|fifty-one|fifty-seven|fifty-six|fifty-three|fifty-two|five|forty|forty-eight|forty-five|forty-four|forty-nine|forty-one|forty-seven|forty-six|forty-three|forty-two|four|fourteen|nine|nineteen|one|seven|seventeen|seventy|six|sixteen|sixty|sixty-eight|sixty-five|sixty-four|sixty-nine|sixty-one|sixty-seven|sixty-six|sixty-three|sixty-two|ten|thirteen|thirty|thirty-eight|thirty-five|thirty-four|thirty-nine|thirty-one|thirty-seven|thirty-six|thirty-three|thirty-two|three|twelve|twenty|twenty-eight|twenty-five|twenty-four|twenty-nine|twenty-one|twenty-seven|twenty-six|twenty-three|twenty-two|two
gives no improvement over the unsorted case.
If the regex is parenthesed to (one)|(two)|(three)|(four)|(five)|(six)|(seven)|(eight)|(nine)|(ten)|(eleven)|(twelve)|(thirteen)|(fourteen)|(fifteen)|(sixteen)|(seventeen)|(eighteen)|(nineteen)|(twenty)|(twenty-one)|(twenty-two)|(twenty-three)|(twenty-four)|(twenty-five)|(twenty-six)|(twenty-seven)|(twenty-eight)|(twenty-nine)|(thirty)|(thirty-one)|(thirty-two)|(thirty-three)|(thirty-four)|(thirty-five)|(thirty-six)|(thirty-seven)|(thirty-eight)|(thirty-nine)|(forty)|(forty-one)|(forty-two)|(forty-three)|(forty-four)|(forty-five)|(forty-six)|(forty-seven)|(forty-eight)|(forty-nine)|(fifty)|(fifty-one)|(fifty-two)|(fifty-three)|(fifty-four)|(fifty-five)|(fifty-six)|(fifty-seven)|(fifty-eight)|(fifty-nine)|(sixty)|(sixty-one)|(sixty-two)|(sixty-three)|(sixty-four)|(sixty-five)|(sixty-six)|(sixty-seven)|(sixty-eight)|(sixty-nine)|(seventy)
it works in all cases. But with non-capturing parentheses (?:one)|(?:two)|(?:three)|(?:four)|(?:five)|(?:six)|(?:seven)|(?:eight)|(?:nine)|(?:ten)|(?:eleven)|(?:twelve)|(?:thirteen)|(?:fourteen)|(?:fifteen)|(?:sixteen)|(?:seventeen)|(?:eighteen)|(?:nineteen)|(?:twenty)|(?:twenty-one)|(?:twenty-two)|(?:twenty-three)|(?:twenty-four)|(?:twenty-five)|(?:twenty-six)|(?:twenty-seven)|(?:twenty-eight)|(?:twenty-nine)|(?:thirty)|(?:thirty-one)|(?:thirty-two)|(?:thirty-three)|(?:thirty-four)|(?:thirty-five)|(?:thirty-six)|(?:thirty-seven)|(?:thirty-eight)|(?:thirty-nine)|(?:forty)|(?:forty-one)|(?:forty-two)|(?:forty-three)|(?:forty-four)|(?:forty-five)|(?:forty-six)|(?:forty-seven)|(?:forty-eight)|(?:forty-nine)|(?:fifty)|(?:fifty-one)|(?:fifty-two)|(?:fifty-three)|(?:fifty-four)|(?:fifty-five)|(?:fifty-six)|(?:fifty-seven)|(?:fifty-eight)|(?:fifty-nine)|(?:sixty)|(?:sixty-one)|(?:sixty-two)|(?:sixty-three)|(?:sixty-four)|(?:sixty-five)|(?:sixty-six)|(?:sixty-seven)|(?:sixty-eight)|(?:sixty-nine)|(?:seventy)
there is no improvement
Or with dates jan|january|feb|february|mar|march|apr|april|may|may|jun|june|jul|july|aug|august|sep|september|oct|october|nov|november|dec|december
Does not match any of mar jun jul
See issue #44. Commenting out the Prefix factoring resolves the disjunction issues above, i,e
f := TFLRE.Create('(124|864|377|3)', []);
f.ExtractAll('abracadabra-abracadabra-3',matches)
and the months work.
This issue was too big. It is hard to see which test cases have been fixed and which ones are still failing.
I have opened a separate issue for each group of test cases: #65 #67 #68 #69 #70 #72 #73 #74
The XQTS knows the following regexs that should match, but do not: