crystal-lang / crystal

The Crystal Programming Language
https://crystal-lang.org
Apache License 2.0
19.32k stars 1.61k forks source link

Support `PCRE` `MARK` verb #11320

Open Blacksmoke16 opened 2 years ago

Blacksmoke16 commented 2 years ago

https://www.pcre.org/original/doc/html/pcrepattern.html and search for Recording which path was taken. Would need to add some additional bindings to LibPCRE and expose this value via MatchData#mark. Something like:

match = /X(*:A)Y|X(*:B)Z/.match("XZ").not_nil!
match.mark # => "B"

Related to https://github.com/crystal-lang/crystal/issues/8199, one con of this would be locking us into PCRE, given I'm not so sure other engines support it. However maybe https://github.com/crystal-lang/crystal/issues/3852 would fix it by optimizing the regex to be more performant?

straight-shoota commented 2 years ago

What's the use case?

This would manifest very engine-specific behaviour in our generic stdlib API. I do see that it might be interesting sometimes to know how a regular expression matched. But we should be fairly certain that it's really useful if we want to go that way. It really doesn't feel that important too me.

Blacksmoke16 commented 2 years ago

@straight-shoota The idea is that you can use the MARK value to key into a hash that stores extra data about the match. My primary use case is for a router I been prototyping, which follows the approach Symfony uses:

return [
    false, // $matchHost
    [ // $staticRoutes
        '/' => [[['_route' => 'app_index'], null, null, null, false, false, null]],
    ],
    [ // $regexpList
        0 => '{^(?'
                .'|/add/([^/]++)(?:/([^/]++))?(*:34)'
            .')/?$}sD',
    ],
    [ // $dynamicRoutes
        34 => [
            [['_route' => 'app_add', 'val2' => null], ['val1', 'val2'], null, null, false, true, null],
            [null, null, null, null, false, false, 0],
        ],
    ],
    null, // $checkCondition
];

in that they dump out an array with the route data and a regex that when matched can be used to access information about the route, like it's default values, variables, etc.

When combined with #11325 (and possibly more so with #11331 if it goes PCRE2 route), this leads to a very very efficient and performant router, at least based on some initial discovery:

/get/books/23/chapters
retour   1.67M (600.29ns) (± 0.90%)    944B/op   3.42× slower
 amber 999.95k (  1.00µs) (± 0.80%)  1.24kB/op   5.70× slower
 radix   1.65M (606.45ns) (± 0.81%)    432B/op   3.46× slower
Athena   5.70M (175.43ns) (± 1.69%)    128B/op        fastest
straight-shoota commented 2 years ago

I'd call that a very specific use case. I'm not sure this is really necessary for a standard implementation.

Your router could surely just use custom libpcre bindings to get what you want for this particular use case.

In light of #11331, I would certainly not invest in this feature unless we're sure it will continue to work when we switch to a different engine.

Blacksmoke16 commented 2 years ago

In light of #11331, I would certainly not invest in this feature unless we're sure it will continue to work when we switch to a different engine.

Yea, it's not implemented in any other engine as far as I know. Which I guess is why I'm a bit biased on the PCRE2 path 😉. I'll convert my PR to a draft until we come to a decision on that.

Your router could surely just use custom libpcre bindings to get what you want for this particular use case.

My thinking there was if we do go the PCRE2 route it probably wouldn't hurt to include it mainly because it's pretty straightforward. Even if we go the PCRE2 route but don't want to include it, afaik, there isn't a real clean way to even implement this as a shard or something without totally overriding some initializers.

EDIT: Or I guess creating a specialized custom Regex class would also work. but 🤷. EDIT2: I just went the route of creating a specialized regex class in the router that is backed via PCRE2 with fast path API. Given i only need this feature in 1 place. So this way other regexes can use standard implementation.

Blacksmoke16 commented 1 year ago

So now that we decided to go the PCRE2 route, this issue is essentially unblocked. Of which it can be implemented for both versions to remain backwards compatible. However the main question I have is how to best implement it given the engines have been abstracted out.

Probably be best to add it to the public API, but maybe add a note the underlying regex engine needs to support it? I.e. if we switch to something that doesn't it'll just always return nil or something. While in our case both PCRE versions can fill it in. Tho could also handle that down the line given the engines aren't really meant to be easily swapped out at the moment.

straight-shoota commented 1 year ago

I'm just a bit hesitant about adding support for such lesser-used regex features to the stdlib API. Your use case for routing looks good, but there don't seem to be many more (at least nobody else has asked for it yet). So this doesn't really feel like a stdlib feature to me. Sure, implementing isn't hard. And I think we don't need to even support PCRE for a new feature. If you want to use it, update to PCRE2. But still we'd be supporting something that's more of a niche feature, which means caring for maintenance. And as you mentioned, a potential problems for switching regex engines (which isn't really planned at this point and we should hopefully be good with PCRE2 for a very long time, so I don't worry too much about that).

Surely, you can also use a custom API for such specific features in your shard as you currently do. IMO that's not a bad choice for this. Perhaps it could be extracted to a generic shard for easy re-usability. There are other more specialized regex features that could fit there as well (for example partial matching). I guess it would be left to figure out if and how this could be integrated with the stdlib Regex class, though. But I think I'd prefer outsourcing extended regex features into a shard over integrating them into stdlib.

Blacksmoke16 commented 1 year ago

I guess my main thought is it just feels less than ideal to maintain essentially a duplicate of the code/bindings just so I can add a handful of lines to access data that's already there but just not exposed. Then you need to make sure all the regexes that need it are using this custom implementation, can't really make use of regex literals, and any bug fixes/new features to stdlib Regex types need to be back ported.

Now that we're going for PCRE2, there is less code that would need to be duplicated, and I could prob monkey patch things in, but those core methods (like the match data constructor and such) are things that would have to be touched and kept up to date which is kind of annoying.

Maybe there is room for a cleaner way to add additional Regex features that doesn't involve duplicating/monkey patching the stdlib implementation at all.