Open lizan opened 5 years ago
RFC 6570 is great for creating URIs, but seems mum on actual matching patterns. What would be the plan for turning URI Templates into matchers?
I ask because we're in need of a more flexible route-matching, but are trying to avoid exposing the complexity (both in user-friendliness and in runtime speed) of regexes. We're thinking of building something based on "glob" operators:
The "glob" operators here follow common UNIX command-line conventions in order to be as familiar and flexible as possible. For non-URL pattern matching, such as in header/cookie values and/or in CORS policies, the
*
operator matches any number of (zero or more) characters.
- The "*" operator, which matches zero or more characters, up to the next path separator ("/")
- Example: "/api/v1/*" would match "/api/v1/get" but not "/api/v1/resourceName/post"
- Similarly, “/videos/*.ts” would match “/videos/foo.ts” but not “/videos/id/1234/foo.ts”
- The "**" operator, which matches zero or more characters, ignoring any path separators.
- Example: "/videos/**.ts" would match "/videos/hd/18313/output/hevc/segment-000001.ts", "/videos/sd/90574/output/dash/segment-000555.ts" and "/videos/test/foo.ts"
- The "?" operator, which matches exactly one character.
- This can only be specified once (use
*
or**
if you need to match multiple characters)- Example: "/api/v?/**" would match "/api/v1/hello/get" but not "/api/v1beta1/hello/get"
The gRPC transcoder extends the RFC 6570 syntax so that expansion expressions can take matcher expressions. The matcher expressions are of the form {name=TEMPLATE}
where TEMPLATE
can expand to a combination of literals and glob operators (including *
and **
!). See http.proto for more details.
gRPC's TEMPLATE
doesn't allow single-character ?
, but presumably that wouldn't be hard to add especially since ?
is already a reserved character for gRPC's TEMPLATE
syntax.
We're not wedded to our particular syntax, but do need globbing, so something like gRPC could work for us.
What's the best way to move forward?
http.proto with ?
added SGTM as the API here.
@mattklein123 @lizan @markdroth @envoyproxy/api-shepherds WDYT?
Also tagging @louiscryan @costinm for Istio.
I talked to @htuch about this offline and in general I'm in favor. I don't feel very strongly about the language so I would go with whatever the simplest thing is that satisfies the known use cases. In terms of Envoy implementation/API details, I think the easiest thing to do is going to be to implement a new regex engine here: https://github.com/envoyproxy/envoy/blob/b23ee3b438050228b7c7b0097d182ac828fa2247/api/envoy/type/matcher/v3/regex.proto#L49-L53
Each engine implicitly has a supported language that is self enforced. I think we could just implement a new engine which has a very limited language and offer that as an option. I admittedly haven't thought through all the details, but I think this will be very simple from an API perspective to get working.
One other consideration of Justins example above is that the matcher emits an attribute/metadata by tagging match segments. This is useful for rewrites on forward and has uses in other policy contexts. Same would be true for regex groups.
@htuch
Yep, I think this is a pretty powerful approach. @louiscryan can we confirm that from the Istio-side that http.proto
would be a good choice?
@qiwzhang @nareddyt @TAOXUY can you guys take a look at this? I know you previously had an efficient trie-based implementation of this URI template pattern matching in a custom Envoy filter. This would at least be good prior art, if not the basis of this new implementation.
It would be ideal to have first-class support for URI templates in Envoy. For the time being, our product ESPv2 converts URI templates into regex route matchers when configuring the router.
I ask because we're in need of a more flexible route-matching, but are trying to avoid exposing the complexity (both in user-friendliness and in runtime speed) of regexes.
I know you previously had an efficient trie-based implementation of this URI template pattern matching in a custom Envoy filter. This would at least be good prior art, if not the basis of this new implementation.
Before we switched to converting URI templates to regex route matchers during config time, we has a custom filter and C++ library that supported URI templates in the data plane. It is similar to the library currently used by the gRPC transcoder filter. The trie-based implementation allows O(log(n))
path matching.
However, we moved away from this approach because the Envoy platform team had security concerns about introducing a handwritten parser into Envoy's dataplane.
I think the easiest thing to do is going to be to implement a new regex engine here. Each engine implicitly has a supported language that is self enforced. I think we could just implement a new engine which has a very limited language and offer that as an option.
It sounds like the current approach is to add first-class support for URI templates in the route configuration, but internally envoy will still use regex matchers to parse the URI template. Can we move to using the trie-based parser library and remove the intermediate regex? Otherwise I do not see much value added - users can convert the URI templates to regex at config time like we do currently, it's just moving that logic into Envoy.
Per my understanding of mattklein123 idea of new engine_type inside RegexMatcher, it is NOT using regex. This new type can be implemented by grpc path_matcher which supports full URL template syntax. This change is very isolated, will not impact other code. I think its a good idea, it will achieve the same goal of not using regex to implement url_template.
For example
oneof engine_type {
option (validate.required) = true;
// Google's RE2 regex engine.
GoogleRE2 google_re2 = 1 [(validate.rules).message = {required: true}];
// Google url tempalte, /foo/{x=**} or /bar/*/foo
string google_url_template = 2;
}
// The regex match string. The string must be supported by the configured engine.
string regex = 2 [(validate.rules).string = {min_len: 1}]; <=== can be empty for google_rul_template
}
Implementation wise I'm good with either internal safe_regex translation or the gRPC-JSON transcoder library if we can persuade ourselves it is robust here. @qiwzhang points is correct, Matt is suggesting we use the pluggable "regex" API message, not that this is actually a regex.
@qiwzhang what is you take on fuzzing and code quality/review for the path matcher?
Either way we go, we have code that requires tests, fuzzing and review. In particular, the tests/fuzzing are basically common if we treat this like a blackbox.
Yes, the name RegexMatch
is confusing after adding google_url_template
. Maybe can rename it.
path_matcher code quality should be ok. It already has fuzz tests for its http_template code, we can add fuzz tests for path_matcher code too.
I think Url Template will be used only for RouteMatch. Instead adding a new engine_type for RegexMatch, we can add url_template
to RouteMatch as one of path_specifier.
How about:
oneof path_specifier {
option (validate.required) = true;
...
ConnectMatcher connect_matcher = 12;
// Support google url tempalte, /foo/{x=**} or /bar/*/foo
string url_template = 13;
}
path_matcher trie-based implementation may have potential of achieve O(1) speed with n RouteMatch entries.
If n of RouteMatch entries use the new url_template, they can build into one patch_match tree. There will be one tree lookup for all n route entries.
For example, if we have 3 route entries with:
match:
url_template: /foo/{x=*}
...
match:
url_template: /bar/{y=*}
...
match:
url_template: /bar/{y=*}/**
...
All of above 3 url_template route entries can be build into one path_match tree. Just one tree lookup, it can tell if there is any match. Instead of O(n) for route match search, it will be O(1).
Such optimization could be transparent to users, it is just an internal implementation detail.
I'm a big fan of tries for URL matching and there's no doubt it is faster than iterating a list of regex's, but FWIW, I'm pretty sure this is not O(1), it's O(height of tree) isn't it? I believe in the worst case it is basically O(n) e.g.
match:
url_template: /{x}
...
match:
url_template: /{x}/{y}
...
match:
url_template: /{x}/{y}/{z}
...
path = "/a/b/c"
The trie is still a bit better than the worst case of iterating a list of regex's which would be O(n*m), where m is the avg complexity of evaluating a single regex. Of course the trie does even better in real life, but claiming the trie is always constant time is misleading.
Yes, you are correct. @josheinhorn. It is O(h), h = height of the tree, which is the maximum number of path segments from all url_templates.
FWIW, I would really prefer not to encode this in xDS as a type of regex, since it's really not -- it's a completely different template-based matching syntax, so calling it a regex is very confusing. Instead, I think we should make this another type inside of StringMatcher
.
+1 to not to encode this in xDS as a type of regex.
Also I think this is not as simple as adding new matcher type. The HTTP URI template path matcher need all matching rules to build the trie, so it doesn't work well with current first-match-wins rule. Perhaps we need the tree based route matcher first? cc @snowp
Hi @lizan , here is my idea of grouping multiple url_template matches into a path_matcher _tree; not all route entries need to use url_template, the route order need to be preserved too. so only group the adjacent url_template route entries into a tree.
For example
match1:
prefix: /bar
match2:
url_template: /{x}
match3:
url_template: /bar/{x}
match4:
path: /foo/bar
match5:
url_template: /foo/{x}
We will group match2 and match3 into a tree, and match5 into another tree.
My suggestion to put it in the regex message was for API simplicity. I don't feel that strongly about it. If it's easy to put it elsewhere we can do that.
My suggestion would be to not intertwine with FIFO vs. O(1). Can't we add this with implicit FIFO support first and then deal with the sub-linear support when we add full sub-linear tree matching for main route selection?
Yes, sub-linear should be deferred to when we take the generalized matchers from @snowp and bring them to RouteConfiguration IMHO.
@qiwzhang In that case you can't build the URI matcher in the same trie? It is not ideal works though.
OK, for now, I will just implement one path_matcher tree per one url_template, not to combine multiple ones into one tree.
@nareddyt I understand from https://github.com/envoyproxy/envoy/issues/7763#issuecomment-789260026 the possible advantage of grouping related matches into a tree, but in #15493 we're racing RE2 and the path matcher head:head, surely they should, for a single match, be comparable?
@qiwzhang - since #15299 tackles the path matching only - do you also intend to include expanding the RouteAction to support template-based rewrites as part of this overall effort?
We may also need to clarify how prefix_rewrite
works (if at all) in conjunction with url_template
, and decide on the Envoy-side design (naming, behaviour) for a url_template_rewrite
(or similar) field.
Looking at #15299 taught me a bunch about the syntax we may want to allow. Two points came out of that:
/**.m3u8
or /media/*.m4s
. The gRPC transcoding http.proto syntax supports suffix matches by abusing the :verb
syntax, but it’s not very explicit. In particular, the http.proto syntax allows the verb to match any “LITERAL” value rather than being restricted to HTTP verbs, so our examples could be encoded as /**:.m3u8
or /media/*:.m4s
, which works but is probably not what we want in the Envoy API.With those lessons in mind, here's a more detailed syntax proposal that similar to the http.proto syntax, but fixes these problems.
We propose wildcard support based on match patterns and templates. A match pattern matches an incoming URL path. Template patterns are used to re-write URLs.
Match patterns support glob operators to match URL text and variable definitions to bind matched text to names.
Template patterns build new URLs and may reference variables bound by a match pattern.
match = "/" match_segments [ suffix ]
match_segments = match_segment [ "/" match_segments ]
match_segment = operator | variable | LITERAL
variable = "{" IDENT [ "=" var_value ] "}"
var_value = operator | LITERAL [ "/" var_value ]
operator = path_glob | text_glob
path_glob = "*"
text_glob = "**"
template = "/" template_segments [ suffix ]
template_segments = template_segment [ "/" template_segment ]
template_segment = reference | LITERAL
reference = "{" IDENT "}"
suffix = LITERAL
An IDENT
can be any string matching the regular expression [a-zA-Z][a-zA-Z0-9_]
.
A LITERAL
matches any sequence of characters URL path excluding the reserved characters /
, *
, **
, {
, =
, }
, and ?
.
The path_glob
operator matches 1 or more LITERAL
characters until the next path separator /
. The path_glob operator is greedy.
The text_glob operator matches 0 or more characters in the set of {LITERAL
union /
}. (The text glob operator ignores /
as a separator.) The text_glob
operator is greedy. If a text_glob
operator appears, it must be the last operator in the pattern.
/**.m3u8
would match /foo.m3u8
and /foo/bar.m3u8
./{dir_name}/*.ts
would match /example/file.ts
and bind dir_name="example"
for a later template match to use./{dir_name}/**.ts
would match /example/path/file.ts
and bind dir_name="example"
for a later template match to use. This would also match /example/.ts
, which may or may not be a desired behavior./{path=v1/*}/{file=*.ts}
would match /v1/example/movie.ts
(binding path="v1/example"
and file="movie"
), but would not match /v0/example/movie.ts
./media/**/*/**
is not allowed because at most one **
operator is allowed and it must be the last one./api/v*/endpoint
and /api/*v/endpoint
are not allowed because a path segment cannot have a prefix or suffix. Only the last match can have a suffix./api/{version.major}/{version.minor}
is not allowed because of the .
in the variable name.SGTM; @lizan @markdroth @mattklein123 @qiwzhang @josheinhorn @nareddyt are you folks happy to commit to this as PoR?
SGTM
Note that there are some extensions to this grammar that we could implement in the future by broadening the restricted syntax proposed above:
/api/{version=v*}/1234
to extract versions starting with “v”/api/{version=*beta}/1234
to extract versions ending in beta.I left these extensions out of the proposal above because they do make things more complicated, they're incompatible with http.proto, and no one is asking for them (yet?). However, I believe they would be backwards-compatible changes.
LGTM except one part:
The variable names in http.proto are supposed to map to protocol buffer messages and sub-messages and, as such, they allow embedded dots. I'm not sure we want that for pattern matching, especially if it's only going to be used to name things for re-writes.
This gives me pause since it's a divergence from the existing syntax, and I don't entirely see how the inclusion of periods in the variable names conflicts with periods in the value. For example, what is wrong with /{file.dir}/{file.name=*.ts}
? My primary concern is that there is already a lot of prior art using the google.api.http
annotation that uses periods in variable names and I can imagine developers wanting to somehow re-use their existing annotations via Envoy. It may also create some confusion.
Would be very useful to be able to use dynamic Metadata or header values I the re-write rules
Would be awesome to have that supported.
Description: This was a part of discussion for https://github.com/envoyproxy/envoy/issues/7728 but a separate issue, to add full or subset of URI Template (RFC6570) based route matching. The URI Template is already used in OpenAPI spec and gRPC JSON transcoding binding.
It will reduce the use case for regex as I see many use of regex is to support this.