BaseXdb / basex

BaseX Main Repository.
http://basex.org
BSD 3-Clause "New" or "Revised" License
661 stars 267 forks source link

Fixes for failing regex tests (fixes #2240) #2247

Closed GuntherRademacher closed 9 months ago

GuntherRademacher commented 9 months ago

This PR contains three changes addressing three problems:

  1. eb1394b0188025129021b6495018a1d334efbfaf: a literal dash was allowed in a regex when it followed a MultiCharEsc, violating the restriction that it is only allowed at the begin or end of its charGroup. An extra check has been added verifying this. This fixes tests p888-p891.
  2. cce82a319f2d57a5bc97acf14360370acd2158cc: backrefs where not considered to match, when the corresponding group is optional and did not match anything - this is the JRE's regex engine's way of handling this. It was changed by making the group mandatory and wrapping its content in an optional non-capturing group. This fixes tests p295 and p303.
  3. 8769b7db98d0449fb3ac57cb7a3e05530d7ab492: backrefs were not considered to match, when they are in a different branch of the same set of alternatives as the corresponding group. They are expected to match empty in this case, though. This has been fixed by just omitting such backrefs while serializing the parsed expression into a Java pattern. This fixes test p294.

While these changes fix the corresponding tests, we are operating here close to the borders of what can be done with translating to Java regexes. There may be more cases where the match-empty restriction for backref'ed groups produces unexpected results. On the other hand, all of those are most likely corner cases of minor importance. If you'd prefer to solve this by a using a different (new?) regex implementation, please let me know.

ChristianGruen commented 9 months ago

While these changes fix the corresponding tests, we are operating here close to the borders of what can be done with translating to Java regexes. There may be more cases where the match-empty restriction for backref'ed groups produces unexpected results. On the other hand, all of those are most likely corner cases of minor importance. If you'd prefer to solve this by a using a different (new?) regex implementation, please let me know.

@GuntherRademacher Thanks for your efforts. I think it’s perfectly fine to stick with the existing regex conversion… But I was still positively surprised that you managed to fix the remaining corner cases.

In https://github.com/BaseXdb/basex/issues/2240#issuecomment-1723188208, you indicated that p888-p891 could be invalid. Do you think we should report this back, or have you come to the conclusion that the tests conform to the specs?

GuntherRademacher commented 9 months ago

@ChristianGruen You may be referring to the inital edit of my comment. When I wrote it, I was under the impression that the patterns were valid, which however missed this,

The - character is a valid character range only at the beginning or end of a [·positive character group·]

Given this, those patterns are invalid and should be rejected, so the tests are OK. I had then edited my comment half an hour later, before you commented on it.

ChristianGruen commented 9 months ago

Perfect. You are right, I had read both the initial and the most recent commit, and I must have mixed them up.