Closed rrthomas closed 3 years ago
You've definitely gone quite deep down that rabbit hole already!
The fn:tokenize
implementation predates the integration (and existence) of xspattern
. Given that there are a number of differences between the regular expression languages used by XPath vs. the one in JavaScript, our goal is to eventually have all regex functions (fn:matches
, fn:replace
, fn:tokenize
and fn:analyze-string
) be backed by xspattern
. There are two parts to making that happen: the regex language and the xspattern
API.
While xspattern already implements the full regular expression language defined by XML Schema, the regex language defined by XPath modifies that specification in some ways and adds a few features. The bwrrp/xspattern.js#9 issue keeps track of what has been implemented and what features are still missing. Captures may be fairly straightforward by emitting a few record instructions in the whynot
program to mark the start and end of each capture group. Greedy vs. reluctant matching can probably be implemented by filtering the traces produced by whynot
. I'm not sure if it's possible to implement backreferences using the whynot
-based approach, we might need an alternative implementation for those.
I've not yet taken the time to look at what API shape would be most convenient to support the additional use cases for regexes in fontoxpath
. At first glance we'd probably want something that returns position information for any matches / captures found while applying the regex to the input, but in a way that picks the spec-compliant solution out of the (possibly) multiple solutions that may be returned by whynot
. Supporting multiple matches per input string could then be done by either looping the whynot
program, or perhaps by optionally allowing the accept
instruction to succeed even if there is input remaining and executing the program again from the end of the previous match detected this way.
I'm taking some time off traveling abroad right now, which is why I am not responding too often. Thanks Stef for covering.
Commenting on this to point out one more detail in fn:replace that is annoying: there is afaik no notion of greedyness or reluctancy: instead, in cases of multiple traces through a choice: the first option in source order must be taken:
If two alternatives within the pattern both match at the same position in the $input, then the match that is chosen is the one matched by the first alternative. (https://www.w3.org/TR/xpath-functions-31/#func-replace)
This got me stuck the first time around.
Backreferences may be possible by acting like the input matches anything and filtering later. Worry about performance when we get there.
Awesome discussion this, happy to see support for patterns coming up more and more.
Op vr 23 jul. 2021 20:05 schreef Stef Busking @.***>:
You've definitely gone quite deep down that rabbit hole already!
The fn:tokenize implementation predates the integration (and existence) of xspattern. Given that there are a number of differences between the regular expression languages used by XPath vs. the one in JavaScript, our goal is to eventually have all regex functions (fn:matches, fn:replace, fn:tokenize and fn:analyze-string) be backed by xspattern. There are two parts to making that happen:
While xspattern already implements the full regular expression language defined by XML Schema, the regex language defined by XPath modifies that specification in some ways and adds a few features. The bwrrp/xspattern.js#9 https://github.com/bwrrp/xspattern.js/issues/9 issue keeps track of what has been implemented and what features are still missing. Captures may be fairly straightforward by emitting a few record instructions in the whynot program to mark the start and end of each capture group. Greedy vs. reluctant matching can probably be implemented by filtering the traces produced by whynot. I'm not sure if it's possible to implement backreferences using the whynot-based approach, we might need an alternative implementation for those.
I've not yet taken the time to look at what API shape would be most convenient to support the additional use cases for regexes in fontoxpath. At first glance we'd probably want something that returns position information for any matches / captures found while applying the regex to the input, but in a way that picks the spec-compliant solution out of the (possibly) multiple solutions that may be returned by whynot. Supporting multiple matches per input string could then be done by either looping the whynot program, or perhaps by optionally allowing the accept instruction to succeed even if there is input remaining and executing the program again from the end of the previous match detected this way.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/FontoXML/fontoxpath/issues/400#issuecomment-885810223, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABGKEJEDP3S5A2VGA3IRSK3TZGVPRANCNFSM5A3S74FQ .
The "matching in source order" is presumably the reason why you can't simply compile an XPath regex to a JavaScript regex?
Transpiling XPath regexes to JS may be possible in many cases, but some features (such as backreferences) are not available in the JS RegExp language, and care should probably be taken to make sure that character classes and such are interpreted correctly w.r.t. the unicode spec.
~I just shipped a new release: 3.20.0. This includes basic support for the fn:pattern
function. Thanks a lot for the help!!~
NVM, overlooked the fact this has not landed yet
It's in! 3.21.0 has this!
I just looked into this. I would like to be able to use the
functx
library. I have gone through it and commented out all the functions that fontoxpath cannot currently compile. There are quite a lot, but they depend on a relatively few missing XQuery functions. The most interesting, for me, isreplace
.I had a look at
CONTRIBUTING.md
and started to dig in. I see that thematches
implementation usesxspatterns
, which in turn relies onwhynot
. Currently,xspatterns
only returns a boolean, which is enough formatches
but not forreplace
.whynot
in turn will return the traces that matched, which is enough forreplace
, but will still need some work to feed throughxspatterns
, which would also need extending to implement captures.Meanwhile, I see that the fontoxpath implementation of
tokenize
does not usexspatterns
at all, but JavaScript'sRegExp
! (Again, not surprising, asxspatterns
would not be sufficient in this case.)I am unsure how to proceed.
I see that the implementation of
matches
does not support theflags
argument; in the first instance, this is fine for me, as I don't needflags
for my use ofreplace
.I would be happy to implement
replace
using JavaScript regexes directly, which would be simple, but presumably not strictly XQuery-compliant. However, fontoxpath already does this fortokenize
.It looks fairly simple to extend
xspatterns
to return the match; butreplace
(andtokenize
) need all the non-overlapping matching (effectively ag
flag).If you can let me know what would work for the fontoxpath maintainers, I can consider how I might be able to help; in the mean time, I suspect I will simply implement the desired
functx
functions directly in JavaScript. (In the first instance, I simply wanttrim
: as you can see, I've already gone down quite a deep rabbit hole in an effort to see how to do this "properly"!)