Adding support for `fn:replace`

rrthomas commented 3 years ago

I just looked into this. I would like to be able to use the functx library. I have gone through it and commented out all the functions that fontoxpath cannot currently compile. There are quite a lot, but they depend on a relatively few missing XQuery functions. The most interesting, for me, is replace.

I had a look at CONTRIBUTING.md and started to dig in. I see that the matches implementation uses xspatterns, which in turn relies on whynot. Currently, xspatterns only returns a boolean, which is enough for matches but not for replace. whynot in turn will return the traces that matched, which is enough for replace, but will still need some work to feed through xspatterns, which would also need extending to implement captures.

Meanwhile, I see that the fontoxpath implementation of tokenize does not use xspatterns at all, but JavaScript's RegExp! (Again, not surprising, as xspatterns would not be sufficient in this case.)

I am unsure how to proceed.

I see that the implementation of matches does not support the flags argument; in the first instance, this is fine for me, as I don't need flags for my use of replace.

I would be happy to implement replace using JavaScript regexes directly, which would be simple, but presumably not strictly XQuery-compliant. However, fontoxpath already does this for tokenize.

It looks fairly simple to extend xspatterns to return the match; but replace (and tokenize) need all the non-overlapping matching (effectively a g flag).

If you can let me know what would work for the fontoxpath maintainers, I can consider how I might be able to help; in the mean time, I suspect I will simply implement the desired functx functions directly in JavaScript. (In the first instance, I simply want trim: as you can see, I've already gone down quite a deep rabbit hole in an effort to see how to do this "properly"!)

bwrrp commented 3 years ago

You've definitely gone quite deep down that rabbit hole already!

The fn:tokenize implementation predates the integration (and existence) of xspattern. Given that there are a number of differences between the regular expression languages used by XPath vs. the one in JavaScript, our goal is to eventually have all regex functions (fn:matches, fn:replace, fn:tokenize and fn:analyze-string) be backed by xspattern. There are two parts to making that happen: the regex language and the xspattern API.

While xspattern already implements the full regular expression language defined by XML Schema, the regex language defined by XPath modifies that specification in some ways and adds a few features. The bwrrp/xspattern.js#9 issue keeps track of what has been implemented and what features are still missing. Captures may be fairly straightforward by emitting a few record instructions in the whynot program to mark the start and end of each capture group. Greedy vs. reluctant matching can probably be implemented by filtering the traces produced by whynot. I'm not sure if it's possible to implement backreferences using the whynot-based approach, we might need an alternative implementation for those.

I've not yet taken the time to look at what API shape would be most convenient to support the additional use cases for regexes in fontoxpath. At first glance we'd probably want something that returns position information for any matches / captures found while applying the regex to the input, but in a way that picks the spec-compliant solution out of the (possibly) multiple solutions that may be returned by whynot. Supporting multiple matches per input string could then be done by either looping the whynot program, or perhaps by optionally allowing the accept instruction to succeed even if there is input remaining and executing the program again from the end of the previous match detected this way.

DrRataplan commented 3 years ago

I'm taking some time off traveling abroad right now, which is why I am not responding too often. Thanks Stef for covering.

Commenting on this to point out one more detail in fn:replace that is annoying: there is afaik no notion of greedyness or reluctancy: instead, in cases of multiple traces through a choice: the first option in source order must be taken:

If two alternatives within the pattern both match at the same position in the $input, then the match that is chosen is the one matched by the first alternative. (https://www.w3.org/TR/xpath-functions-31/#func-replace)

This got me stuck the first time around.

Backreferences may be possible by acting like the input matches anything and filtering later. Worry about performance when we get there.

Awesome discussion this, happy to see support for patterns coming up more and more.

Op vr 23 jul. 2021 20:05 schreef Stef Busking @.***>:

You've definitely gone quite deep down that rabbit hole already!

The fn:tokenize implementation predates the integration (and existence) of xspattern. Given that there are a number of differences between the regular expression languages used by XPath vs. the one in JavaScript, our goal is to eventually have all regex functions (fn:matches, fn:replace, fn:tokenize and fn:analyze-string) be backed by xspattern. There are two parts to making that happen:

While xspattern already implements the full regular expression language defined by XML Schema, the regex language defined by XPath modifies that specification in some ways and adds a few features. The bwrrp/xspattern.js#9 https://github.com/bwrrp/xspattern.js/issues/9 issue keeps track of what has been implemented and what features are still missing. Captures may be fairly straightforward by emitting a few record instructions in the whynot program to mark the start and end of each capture group. Greedy vs. reluctant matching can probably be implemented by filtering the traces produced by whynot. I'm not sure if it's possible to implement backreferences using the whynot-based approach, we might need an alternative implementation for those.

I've not yet taken the time to look at what API shape would be most convenient to support the additional use cases for regexes in fontoxpath. At first glance we'd probably want something that returns position information for any matches / captures found while applying the regex to the input, but in a way that picks the spec-compliant solution out of the (possibly) multiple solutions that may be returned by whynot. Supporting multiple matches per input string could then be done by either looping the whynot program, or perhaps by optionally allowing the accept instruction to succeed even if there is input remaining and executing the program again from the end of the previous match detected this way.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/FontoXML/fontoxpath/issues/400#issuecomment-885810223, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABGKEJEDP3S5A2VGA3IRSK3TZGVPRANCNFSM5A3S74FQ .

rrthomas commented 3 years ago

The "matching in source order" is presumably the reason why you can't simply compile an XPath regex to a JavaScript regex?

bwrrp commented 3 years ago

Transpiling XPath regexes to JS may be possible in many cases, but some features (such as backreferences) are not available in the JS RegExp language, and care should probably be taken to make sure that character classes and such are interpreted correctly w.r.t. the unicode spec.

DrRataplan commented 3 years ago

~I just shipped a new release: 3.20.0. This includes basic support for the fn:pattern function. Thanks a lot for the help!!~

NVM, overlooked the fact this has not landed yet

DrRataplan commented 3 years ago

It's in! 3.21.0 has this!

FontoXML / fontoxpath

Adding support for `fn:replace` #400