BaseXdb / basex

BaseX Main Repository.
http://basex.org
BSD 3-Clause "New" or "Revised" License
661 stars 267 forks source link

XQuery: Regular expressions cont. #2240

Closed ChristianGruen closed 9 months ago

ChristianGruen commented 10 months ago

@GuntherRademacher I must take back my statement. I have indeed overlooked various tests that are now failing:

(: expected: true :)
(every $s in tokenize('abcxyz}', ',') satisfies matches($s, '^(?:[a-\}]+)$')) and (every $s in tokenize('', ',') satisfies not(matches($s, '^(?:[a-\}]+)$'))),

(: expected: FORX0002 :)
matches('qwerty','[a-\\]'),
matches('qwerty','[a-\[]'),
matches('qwerty','foo([a-\d]*)bar'),
matches('qwerty','([5-\D]*)bar'),
matches('qwerty','foo([6-\s]*)bar'),
matches('qwerty','foo([c-\S]*)'),
matches('qwerty','foo([7-\w]*)'),
matches('qwerty','foo([a-\W]*)bar'),
matches('qwerty','([f-\p{Lu}]\w*)\s([\p{Lu}]\w*)'),
matches('qwerty','([1-\P{Ll}][\p{Ll}]*)\s([\P{Ll}][\p{Ll}]*)'),
(every $s in tokenize('a-x', ',') satisfies matches($s, '^(?:[a-a-x-x]+)$')) and
(every $s in tokenize('j,a-b', ',') satisfies not(matches($s, '^(?:[a-a-x-x]+)$')))
ChristianGruen commented 10 months ago

With the last fix, all previously successful tests are running again.

I found at least three more tests that return the wrong results (with both BaseX 10.7 and the current code base):

matches('x', '(a)|\1'),
matches('a', '(?:(b)?a)\1'),
matches('babadad', '^((.)?a\2)+$')

Taken from:

I assume that the patterns of the tests with the IDs p888p891 should also be rejected:

for $p in ('([\d-z]+)', '([\d-z]+)', '([\d-\s]+)', '([\d-\s]+)')
let $success := try { matches('x', $p) } catch * { 'expected' }
return $p || ': ' || $success

…no hurries!

GuntherRademacher commented 9 months ago

Hi @ChristianGruen,

today I had a look at the XQuery tests relating to fn:matches. The test run reports a total of 7 problems:

<results>
  <fail>
    <test id="p294" regex="(a)|\1" input="x" result="y"/>
  </fail>
  <fail>
    <test id="p295" regex="(?:(b)?a)\1" input="a" result="y"/>
  </fail>
  <fail>
    <test id="p303" regex="^((.)?a\2)+$" input="babadad" result="y"/>
  </fail>
  <fail>
    <test id="p888" regex="([\d-z]+)" input="a0-za" result="Sy"/>
  </fail>
  <fail>
    <test id="p889" regex="([\d-z]+)" input="-" result="sc"/>
  </fail>
  <fail>
    <test id="p890" regex="([\d-\s]+)" input="a0- z" result="Sy"/>
  </fail>
  <fail>
    <test id="p891" regex="([\d-\s]+)" input="-" result="sc"/>
  </fail>
</results>

Tests p294, p295, and p303 fail, because the JRE's regex engine does not match backreferences, if the corresponding group has not been matched. The reason for not matching it is

I am proposing two changes for handling this:

I still have to work on the actual changes.

The patterns in tests p888-p891should be rejected, because according to XML Schema Part 2,

The - character is a valid character range only at the beginning or end of a ·positive character group·.

In these cases, it follows a MultiCharEsc, so it is the not the meta character of an seRange, but would have to go as a XmlCharIncDash, forming a single character charRange. But the above says that this is only valid at the begin or end of a charGroup.

I think that the pattern in p303 also is not valid with respect to the specification of backreferences, which says

The regular expression is invalid if a back-reference refers to a capturing sub-expression that does not exist or whose closing right parenthesis occurs after the back-reference.

Should we reject it? If yes, that would call for a change of qt4tests and qt3tests.

ChristianGruen commented 9 months ago

@GuntherRademacher Your analysis reads fine, as do your suggestions for fixing p294 and p295.

For the other patterns, it would be great if you could report this back to https://github.com/w3c/qt3tests (as suggested), either in an issue or via a pull request. Thanks!

ChristianGruen commented 9 months ago

PS: Commits to qt3tests are occasionally migrated to the qt4tests repository (see https://github.com/w3c/qt3tests/issues/54).

GuntherRademacher commented 9 months ago

Sorry for the confusion: the pattern in p303 of course is valid, because group 2 is closed before the back-reference occurs. Also the code to check for this condition is in place and works as it should.