lizan commented 5 years ago

Description: This was a part of discussion for https://github.com/envoyproxy/envoy/issues/7728 but a separate issue, to add full or subset of URI Template (RFC6570) based route matching. The URI Template is already used in OpenAPI spec and gRPC JSON transcoding binding.

It will reduce the use case for regex as I see many use of regex is to support this.

justin-mp commented 3 years ago

RFC 6570 is great for creating URIs, but seems mum on actual matching patterns. What would be the plan for turning URI Templates into matchers?

I ask because we're in need of a more flexible route-matching, but are trying to avoid exposing the complexity (both in user-friendliness and in runtime speed) of regexes. We're thinking of building something based on "glob" operators:

The "glob" operators here follow common UNIX command-line conventions in order to be as familiar and flexible as possible. For non-URL pattern matching, such as in header/cookie values and/or in CORS policies, the * operator matches any number of (zero or more) characters.

The "*" operator, which matches zero or more characters, up to the next path separator ("/")

Example: "/api/v1/*" would match "/api/v1/get" but not "/api/v1/resourceName/post"

Similarly, “/videos/*.ts” would match “/videos/foo.ts” but not “/videos/id/1234/foo.ts”

The "**" operator, which matches zero or more characters, ignoring any path separators.

Example: "/videos/**.ts" would match "/videos/hd/18313/output/hevc/segment-000001.ts", "/videos/sd/90574/output/dash/segment-000555.ts" and "/videos/test/foo.ts"

The "?" operator, which matches exactly one character.

This can only be specified once (use * or ** if you need to match multiple characters)

Example: "/api/v?/**" would match "/api/v1/hello/get" but not "/api/v1beta1/hello/get"

The gRPC transcoder extends the RFC 6570 syntax so that expansion expressions can take matcher expressions. The matcher expressions are of the form {name=TEMPLATE} where TEMPLATE can expand to a combination of literals and glob operators (including * and **!). See http.proto for more details.

gRPC's TEMPLATE doesn't allow single-character ?, but presumably that wouldn't be hard to add especially since ? is already a reserved character for gRPC's TEMPLATE syntax.

We're not wedded to our particular syntax, but do need globbing, so something like gRPC could work for us.

What's the best way to move forward?

htuch commented 3 years ago

http.proto with ? added SGTM as the API here.

@mattklein123 @lizan @markdroth @envoyproxy/api-shepherds WDYT?

Also tagging @louiscryan @costinm for Istio.

mattklein123 commented 3 years ago

I talked to @htuch about this offline and in general I'm in favor. I don't feel very strongly about the language so I would go with whatever the simplest thing is that satisfies the known use cases. In terms of Envoy implementation/API details, I think the easiest thing to do is going to be to implement a new regex engine here: https://github.com/envoyproxy/envoy/blob/b23ee3b438050228b7c7b0097d182ac828fa2247/api/envoy/type/matcher/v3/regex.proto#L49-L53

Each engine implicitly has a supported language that is self enforced. I think we could just implement a new engine which has a very limited language and offer that as an option. I admittedly haven't thought through all the details, but I think this will be very simple from an API perspective to get working.

louiscryan commented 3 years ago

One other consideration of Justins example above is that the matcher emits an attribute/metadata by tagging match segments. This is useful for rewrites on forward and has uses in other policy contexts. Same would be true for regex groups.

@htuch

htuch commented 3 years ago

Yep, I think this is a pretty powerful approach. @louiscryan can we confirm that from the Istio-side that http.proto would be a good choice?

josheinhorn commented 3 years ago

@qiwzhang @nareddyt @TAOXUY can you guys take a look at this? I know you previously had an efficient trie-based implementation of this URI template pattern matching in a custom Envoy filter. This would at least be good prior art, if not the basis of this new implementation.

nareddyt commented 3 years ago

It would be ideal to have first-class support for URI templates in Envoy. For the time being, our product ESPv2 converts URI templates into regex route matchers when configuring the router.

I ask because we're in need of a more flexible route-matching, but are trying to avoid exposing the complexity (both in user-friendliness and in runtime speed) of regexes.

13320 added performance benchmarks for Envoy's router with regex routes that emulate common OpenAPI path templates. We decided the performance impact was acceptable and implemented our own URI template to regex converter in golang. This approach has been working fine so far.

I know you previously had an efficient trie-based implementation of this URI template pattern matching in a custom Envoy filter. This would at least be good prior art, if not the basis of this new implementation.

Before we switched to converting URI templates to regex route matchers during config time, we has a custom filter and C++ library that supported URI templates in the data plane. It is similar to the library currently used by the gRPC transcoder filter. The trie-based implementation allows O(log(n)) path matching.

However, we moved away from this approach because the Envoy platform team had security concerns about introducing a handwritten parser into Envoy's dataplane.

I think the easiest thing to do is going to be to implement a new regex engine here. Each engine implicitly has a supported language that is self enforced. I think we could just implement a new engine which has a very limited language and offer that as an option.

It sounds like the current approach is to add first-class support for URI templates in the route configuration, but internally envoy will still use regex matchers to parse the URI template. Can we move to using the trie-based parser library and remove the intermediate regex? Otherwise I do not see much value added - users can convert the URI templates to regex at config time like we do currently, it's just moving that logic into Envoy.

qiwzhang commented 3 years ago

Per my understanding of mattklein123 idea of new engine_type inside RegexMatcher, it is NOT using regex. This new type can be implemented by grpc path_matcher which supports full URL template syntax. This change is very isolated, will not impact other code. I think its a good idea, it will achieve the same goal of not using regex to implement url_template.

For example

 oneof engine_type {
    option (validate.required) = true;

    // Google's RE2 regex engine.
    GoogleRE2 google_re2 = 1 [(validate.rules).message = {required: true}];

    // Google url tempalte, /foo/{x=**} or /bar/*/foo
    string google_url_template = 2;
  }

  // The regex match string. The string must be supported by the configured engine.
  string regex = 2 [(validate.rules).string = {min_len: 1}];   <=== can be empty for google_rul_template
}

htuch commented 3 years ago

Implementation wise I'm good with either internal safe_regex translation or the gRPC-JSON transcoder library if we can persuade ourselves it is robust here. @qiwzhang points is correct, Matt is suggesting we use the pluggable "regex" API message, not that this is actually a regex.

@qiwzhang what is you take on fuzzing and code quality/review for the path matcher?

Either way we go, we have code that requires tests, fuzzing and review. In particular, the tests/fuzzing are basically common if we treat this like a blackbox.

qiwzhang commented 3 years ago

Yes, the name RegexMatch is confusing after adding google_url_template. Maybe can rename it.

path_matcher code quality should be ok. It already has fuzz tests for its http_template code, we can add fuzz tests for path_matcher code too.

qiwzhang commented 3 years ago

I think Url Template will be used only for RouteMatch. Instead adding a new engine_type for RegexMatch, we can add url_template to RouteMatch as one of path_specifier.

How about:

    oneof path_specifier {
        option (validate.required) = true;

        ...

       ConnectMatcher connect_matcher = 12;

       // Support google url tempalte, /foo/{x=**} or /bar/*/foo
       string url_template = 13;
    }

qiwzhang commented 3 years ago

path_matcher trie-based implementation may have potential of achieve O(1) speed with n RouteMatch entries.

If n of RouteMatch entries use the new url_template, they can build into one patch_match tree. There will be one tree lookup for all n route entries.

For example, if we have 3 route entries with:

  match:
     url_template: /foo/{x=*}
  ...
  match:
     url_template: /bar/{y=*}
  ...
  match:
     url_template: /bar/{y=*}/**
  ...

All of above 3 url_template route entries can be build into one path_match tree. Just one tree lookup, it can tell if there is any match. Instead of O(n) for route match search, it will be O(1).

Such optimization could be transparent to users, it is just an internal implementation detail.

josheinhorn commented 3 years ago

I'm a big fan of tries for URL matching and there's no doubt it is faster than iterating a list of regex's, but FWIW, I'm pretty sure this is not O(1), it's O(height of tree) isn't it? I believe in the worst case it is basically O(n) e.g.

  match:
     url_template: /{x}
  ...
  match:
     url_template: /{x}/{y}
  ...
  match:
     url_template: /{x}/{y}/{z}
  ...

path = "/a/b/c"

The trie is still a bit better than the worst case of iterating a list of regex's which would be O(n*m), where m is the avg complexity of evaluating a single regex. Of course the trie does even better in real life, but claiming the trie is always constant time is misleading.

qiwzhang commented 3 years ago

Yes, you are correct. @josheinhorn. It is O(h), h = height of the tree, which is the maximum number of path segments from all url_templates.

markdroth commented 3 years ago

FWIW, I would really prefer not to encode this in xDS as a type of regex, since it's really not -- it's a completely different template-based matching syntax, so calling it a regex is very confusing. Instead, I think we should make this another type inside of StringMatcher.

lizan commented 3 years ago

+1 to not to encode this in xDS as a type of regex.

Also I think this is not as simple as adding new matcher type. The HTTP URI template path matcher need all matching rules to build the trie, so it doesn't work well with current first-match-wins rule. Perhaps we need the tree based route matcher first? cc @snowp

qiwzhang commented 3 years ago

Hi @lizan , here is my idea of grouping multiple url_template matches into a path_matcher _tree; not all route entries need to use url_template, the route order need to be preserved too. so only group the adjacent url_template route entries into a tree.

For example

 match1:
   prefix: /bar
 match2:
   url_template: /{x}
 match3: 
   url_template: /bar/{x}
 match4:
   path: /foo/bar
 match5:
   url_template: /foo/{x}

We will group match2 and match3 into a tree, and match5 into another tree.

mattklein123 commented 3 years ago

My suggestion to put it in the regex message was for API simplicity. I don't feel that strongly about it. If it's easy to put it elsewhere we can do that.

My suggestion would be to not intertwine with FIFO vs. O(1). Can't we add this with implicit FIFO support first and then deal with the sub-linear support when we add full sub-linear tree matching for main route selection?

htuch commented 3 years ago

Yes, sub-linear should be deferred to when we take the generalized matchers from @snowp and bring them to RouteConfiguration IMHO.

lizan commented 3 years ago

@qiwzhang In that case you can't build the URI matcher in the same trie? It is not ideal works though.

qiwzhang commented 3 years ago

OK, for now, I will just implement one path_matcher tree per one url_template, not to combine multiple ones into one tree.

nareddyt commented 3 years ago

15493 updates the router benchmarks for URL templates. Surprisingly, the URL template path matcher is much slower than regex based matching. Most likely because route matches are FIFO (as noted above). So we search a separate tree for every single route instead of a single tree.

htuch commented 3 years ago

@nareddyt I understand from https://github.com/envoyproxy/envoy/issues/7763#issuecomment-789260026 the possible advantage of grouping related matches into a tree, but in #15493 we're racing RE2 and the path matcher head:head, surely they should, for a single match, be comparable?

elithrar commented 3 years ago

@qiwzhang - since #15299 tackles the path matching only - do you also intend to include expanding the RouteAction to support template-based rewrites as part of this overall effort?

We may also need to clarify how prefix_rewrite works (if at all) in conjunction with url_template, and decide on the Envoy-side design (naming, behaviour) for a url_template_rewrite (or similar) field.

justin-mp commented 3 years ago

Looking at #15299 taught me a bunch about the syntax we may want to allow. Two points came out of that:

Our case really wants to do file name extension matches, like /**.m3u8 or /media/*.m4s. The gRPC transcoding http.proto syntax supports suffix matches by abusing the :verb syntax, but it’s not very explicit. In particular, the http.proto syntax allows the verb to match any “LITERAL” value rather than being restricted to HTTP verbs, so our examples could be encoded as /**:.m3u8 or /media/*:.m4s, which works but is probably not what we want in the Envoy API.
The variable names in http.proto are supposed to map to protocol buffer messages and sub-messages and, as such, they allow embedded dots. I'm not sure we want that for pattern matching, especially if it's only going to be used to name things for re-writes.

With those lessons in mind, here's a more detailed syntax proposal that similar to the http.proto syntax, but fixes these problems.

Match Patterns and Templates

We propose wildcard support based on match patterns and templates. A match pattern matches an incoming URL path. Template patterns are used to re-write URLs.

Match patterns support glob operators to match URL text and variable definitions to bind matched text to names.

Template patterns build new URLs and may reference variables bound by a match pattern.

Syntax

match = "/" match_segments [ suffix ]
match_segments = match_segment [ "/" match_segments ]
match_segment = operator | variable | LITERAL
variable = "{" IDENT [ "=" var_value ] "}"
var_value = operator | LITERAL [ "/" var_value ]
operator = path_glob | text_glob
path_glob = "*"
text_glob = "**"

template = "/" template_segments [ suffix ]
template_segments = template_segment [ "/" template_segment ]
template_segment = reference | LITERAL
reference = "{" IDENT "}"

suffix = LITERAL

An IDENT can be any string matching the regular expression [a-zA-Z][a-zA-Z0-9_].

A LITERAL matches any sequence of characters URL path excluding the reserved characters /, *, **, {, =, }, and ?.

The path_glob operator matches 1 or more LITERAL characters until the next path separator /. The path_glob operator is greedy.

The text_glob operator matches 0 or more characters in the set of {LITERAL union /}. (The text glob operator ignores / as a separator.) The text_glob operator is greedy. If a text_glob operator appears, it must be the last operator in the pattern.

Examples

Allowed Rules:

/**.m3u8 would match /foo.m3u8 and /foo/bar.m3u8.
/{dir_name}/*.ts would match /example/file.ts and bind dir_name="example" for a later template match to use.
/{dir_name}/**.ts would match /example/path/file.ts and bind dir_name="example" for a later template match to use. This would also match /example/.ts, which may or may not be a desired behavior.
/{path=v1/*}/{file=*.ts} would match /v1/example/movie.ts (binding path="v1/example" and file="movie"), but would not match /v0/example/movie.ts.

Invalid Rules:

/media/**/*/** is not allowed because at most one ** operator is allowed and it must be the last one.
/api/v*/endpoint and /api/*v/endpoint are not allowed because a path segment cannot have a prefix or suffix. Only the last match can have a suffix.
/api/{version.major}/{version.minor} is not allowed because of the . in the variable name.

htuch commented 3 years ago

SGTM; @lizan @markdroth @mattklein123 @qiwzhang @josheinhorn @nareddyt are you folks happy to commit to this as PoR?

lizan commented 3 years ago

SGTM

justin-mp commented 3 years ago

Note that there are some extensions to this grammar that we could implement in the future by broadening the restricted syntax proposed above:

FUTURE1: Prefix globs within a segment. For example /api/{version=v*}/1234 to extract versions starting with “v”
FUTURE2: Suffix globs within a segment. For example: /api/{version=*beta}/1234 to extract versions ending in beta.
FUTURE3: Globs on file extensions, for example “/files/out.{ext}”. We get this for free if we implement FUTURE1 above. However, if we to be compatible with http.proto’s “verb” feature, this won’t work because http.proto doesn’t allow anything but literal values in the “verb” position

I left these extensions out of the proposal above because they do make things more complicated, they're incompatible with http.proto, and no one is asking for them (yet?). However, I believe they would be backwards-compatible changes.

josheinhorn commented 3 years ago

LGTM except one part:

The variable names in http.proto are supposed to map to protocol buffer messages and sub-messages and, as such, they allow embedded dots. I'm not sure we want that for pattern matching, especially if it's only going to be used to name things for re-writes.

This gives me pause since it's a divergence from the existing syntax, and I don't entirely see how the inclusion of periods in the variable names conflicts with periods in the value. For example, what is wrong with /{file.dir}/{file.name=*.ts}? My primary concern is that there is already a lot of prior art using the google.api.http annotation that uses periods in variable names and I can imagine developers wanting to somehow re-use their existing annotations via Envoy. It may also create some confusion.

jespersoderlund commented 2 years ago

Would be very useful to be able to use dynamic Metadata or header values I the re-write rules

roberth1988 commented 7 months ago

Would be awesome to have that supported.

envoyproxy / envoy

support URI Template based route match #7763

13320 added performance benchmarks for Envoy's router with regex routes that emulate common OpenAPI path templates. We decided the performance impact was acceptable and implemented our own URI template to regex converter in golang. This approach has been working fine so far.

15493 updates the router benchmarks for URL templates. Surprisingly, the URL template path matcher is much slower than regex based matching. Most likely because route matches are FIFO (as noted above). So we search a separate tree for every single route instead of a single tree.

Match Patterns and Templates

Syntax

Examples

Allowed Rules:

Invalid Rules: