[request] support for routing based on path?

DavidTPate commented 9 years ago

Hi there! Been messing with this a little bit as I'm vetting API gateways to use. I really like the idea of the project but I'm running into an issue where my main routing to services is done based upon path instead of Host.

Was wondering if I'm just overlooking the feature or if there has been any talk of the addition of such functionality. Details below.

If for example I had 3 services that I was running (let's call them auth, pizza, and tacos).

As best as I understand it Kong is setup to handle this where each of these services might be available internally at: auth.internaldomain.com, pizza.internaldomain.com, tacos.internaldomain.com and the routing is done based upon someone's intention to go to auth.externaldomain.com, pizza.externaldomain.com, or tacos.externaldomain.com.

Instead my setup deals with routing to services based upon a part of the path. So, the equivalent calls would be made to api.externaldomain.com/auth, api.externaldomain.com/pizza, or api.externaldomain.com/tacos.

From looking through the docs this didn't seem like a feature that currently exists, love to hear your thoughts on it.

PierreKircher commented 9 years ago

+1 simply for the cost of ssl certs

montanaflynn commented 9 years ago

Great idea! I've added a request label, maybe @thefosk or @thibaultCha could chime in on the implementation details as I don't think this is something that could be added as a plugin since the routing is handled by the Kong core.

subnetmarco commented 9 years ago

This is a feature I have been thinking for a while, and I really like the idea. Will draft out a simple implementation.

PierreKircher commented 9 years ago

proxy is a good aproach i guess we can have them standalone next to each other

like

api1.example.com api2.example.com api3.example.com

and after that we setup a new subdomain and add the "api id" + a location folder

not sure if its that trivial but that would not change the way the actual routing works instead it acts as a compliment layer

just my 2 cents here .. please ignore if im beeing missleading with a simplistic aproach

rafeequl commented 9 years ago

:+1:

I've been thinking of this feature for making an abstraction layer between consumer and internal API.

DavidTPate commented 9 years ago

Glad to hear that you guys think it is a useful feature. Another thing to keep in mind is that this would also likely be used for micro-services and splitting up functionality.

So keeping with my previous example and just focusing on auth service available at: api.externaldomain.com/auth. I would have a route setup to send the path /auth to the server at auth.internaldomain.com.

if I noticed that my login functionality was taking lots of hits and causing me to scale even though the rest of my end-points weren't experiencing much load I would split this up to do something like the following. The path api.externaldomain.com/auth/login would be sent to login.internaldomain.com and api.externaldomain.com/auth would remain sending to auth.internaldomain.com.

I think this comes down to just the fact that order or specificity matters. I think order (where first match is the one that matters) is the more common approach due to it being simpler to implement. So in my example the order of my APIs would be something like this:

api.externaldomain.com/auth/login -> login.internaldomain.com
api.externaldomain.com/auth -> auth.internaldomain.com
api.externaldomain.com/pizza -> pizza.internaldomain.com
api.externaldomain.com/tacos -> tacos.internaldomain.com

thibaultcha commented 9 years ago

I like the idea :+1:

Your latest example (with an order of priority to define the routing) is something that we are used to see in code, but I am curious as to what it would look like when configuring it with simple API calls... A priority value maybe?

DavidTPate commented 9 years ago

As I've been going through trying out other API gateways I thought Tyk had some easy to understand options that were useful even in my simple testing. For the configuration of their Proxy piece they have a few simple options which I think would serve as a good starting point.

listen_path - The path to listen on, e.g. /api or /
target_url - This defines the target URL that the request should be proxied to. (just needs to be a valid URL, can include port/protocols/etc)
strip_listen_path - By setting this to true, Tyk will attempt to replace the listen_path in the outgoing request with an empty string - this means that in the above scenario where /listen-path/widgets/new and the URL to proxy to is http://your.api.com/api/ becomes http://your.api.com/api/listen-path/widgets/new, actually changes the outgoing request to be: http://your.api.com/api/widgets/new.

DavidTPate commented 9 years ago

@thibaultCha Yeah, it's a weird one when dealing with an API. You obviously don't want to go by creation date, or update the entire set of APIs. The way that AWS does it for things like Network ACLs which I think is simple is that they have you put in a kind of sort order value. So keeping with my previous example:

Rule Number	Path	Target
100	`/auth/login`	`login.internaldomain.com`
200	`/auth`	`auth.internaldomain.com`
300	`/pizza`	`pizza.internaldomain.com`
400	`/tacos`	`tacos.internaldomain.com`

I think it clearly shows the order (you would just need to make sure Rule Number is unique when sent), and provides an easy for me to insert something between /pizza and /tacos by simply giving it a rule number such as 350.

thibaultcha commented 9 years ago

On an other note, maybe the resolvers too could be plugins. We call the current resolver (by Host header) a core-plugin, since it is identifying an API in the system from a user's request. One resolver could work with the Host header, and another with paths.

Btw, it is worth noting that this is also close to how NGINX natively supports proxying, even if less flexible than what you are describing @DavidTPate:

location /match/this {
    proxy_pass http://example.com/;
}
# A request sent to `/match/this/auth` will be sent to upstream as `/match/this/auth`

location /match/this {
    proxy_pass http://example.com/new/path;
}
# A request sent to `/match/this/auth` will be sent to upstream as `/new/path/auth`

thibaultcha commented 9 years ago

I was thinking about such a rule @DavidTPate. Here the interval between two routes is huge but to be 100% future proof, if one decides to insert a route at a value that already exists we could also "push" all values after the one being inserted.

I don't see another solution right now than this, but it seems decent. An UI can easily and nicely deal with this on top of the admin API.

steinnes commented 9 years ago

I really like this idea. We at QuizUp developed our own nginx based routing solution where we route based on the request path (location).

Additionally, something I would be very interested in seeing (and potentially developing in this project) is support for registering backend nodes directly with Kong. This is to avoid having to rely on (potentially stale) DNS records for discovering backends. Having a complete list of backends (internal IPs usually) per microservice would also allow for more sophisticated load balancing algorithms to be employed (I am thinking least connection, and since there is the possibility of sharing state via cassandra, this makes a lot of sense to me!).

Awesome project btw :-)

thibaultcha commented 9 years ago

@steinnes Could what you described somewhat be related to #157? Load balancing your API with Kong?

steinnes commented 9 years ago

Absolutely similar. The #157 issue seems quite focused on how nginx does this, but I assume we could do this in two ways.

*1. Deltas (ie. add/rem particular backend/upstream from an api):

curl -XPOST --url http://localhost:8001/apis/backends/add \
 --data 'name=mockbin' \
 --data 'upstream=10.0.0.99:1234'

or

curl -XPOST --url http://localhost:8001/apis/backends/rem \
 --data 'name=mockbin' \
 --data 'upstream=10.0.0.99:1234'

*2. Complete overwriting of upstreams (basically a "set" operation):

curl -XPOST --url http://localhost:8001/apis/backends/set \
 --data 'name=mockbin' \
 --data 'upstreams=10.0.0.99:1234,10.0.0.88:4321,10.0.0.77:5678'

Or whatever makes most sense. I just came across your project and immediately decided to start suggesting stuff -- but in my defence, if you guys like the ideas I wouldn't mind contributing :-)

tamizhgeek commented 9 years ago

:+1: We are using a home-built nginx routing to different upstreams based on the request path in the API. We replace the proxy_pass using a variable after matching the path pattern in a regex.

This will also help in having endpoint level rate limiting/throttling.

Would love to have this in kong. Will make our migration to kong much easier!

sonicaghi commented 9 years ago

+1

drabiter commented 9 years ago

+1

montanaflynn commented 9 years ago

+1

Here's a real example of where I do routing based on endpoints in nginx:

server {
  server_name img.apistatus.org;
  location /online {
    proxy_pass http://127.0.0.1:4445/;
  }
  location /status {
    proxy_pass http://127.0.0.1:4446/;
  }
}

thibaultcha commented 9 years ago

Before implementing this, a quick follow-up to see what we think about it.

We currently have a resolver that we can call the "host resolver". I shall refer to the resolver described by @DavidTPate as the "path resolver".

On the usability side:

It would be nice to be able to configure an API wether its routing should happen by host or by path. Say the API now has 2 properties: public_dns (for the host resolver) and path (for the path resolver).

Can an API have both? Which resolver has the priority?
Do we make those resolvers plugins? Are they bundled into the core? At least one needs to be. Which one?

My 2 cents: separate those resolvers but keep them bundled into the core. Have an API use one or the other depending on which property is set. Refuse an API that uses both.

On the implementation side:

A nice solution would be to now separate the properties used by the resolvers and the APIs:

create a host_resolver table, which maps a public_dns to an API.
create a path_resolver table, which maps a path to an API.
Remove the public_dns property from the current apis table.
Rename the target_url property on the apis table to upstream (bonus).

@thefosk @montanaflynn thoughts?

DavidTPate commented 9 years ago

That sounds like a good solution to me. I could see someone attempting to use both a "host resolver" and a "path resolver" but to me that screams of poor API design and I'm honestly not sure if even Nginx has the ability to do both (without duplication of configuration).

montanaflynn commented 9 years ago

@thibaultCha I would say to allow for both with only one or the other being required.

This way you can set up multiple APIs in one Kong install and still handle all the above use cases that @DavidTPate described. The ordering for how Kong would pick which one could be like this:

Matches both host and path
Matches host
Matches path

There's a good answer on stackoverflow about how nginx handles prioritizing paths.

Here's a bigger snippet of the nginx config I put above showing how I'm matching by host & path. Two paths are the same but lead to different outcomes dependent on the host. You'll also notice that I'm using regex in the path which is something that we should consider as well.

server {

  # Matches this host
  server_name img.apistatus.org;

  # And this path
  location /online {
    proxy_pass http://127.0.0.1:4445/apistatus/online;
  }

  # Or this path
  location /status {
    proxy_pass http://127.0.0.1:4446/apistatus/status;
  }

  # Or this path
  location /robots.txt {
    return 200 "User-agent: *\nAllow: /";
  }

}

server {

  # Matches this host
  server_name apistatus.org;

  # this path matches if none of the others do
  location / {
    root /usr/share/nginx/www/apistatus;
    index index.html;
  }

  # Or this path which is also defined above
  location /robots.txt {
    return 200 "User-agent: *\nDisallow: /";
  }

  # Or this path which uses regex
  location ~* \.(gif|jpg|jpeg)$ {
    rewrite ^/images/(.*)(png|jpg|gif)$ http://127.0.0.1:4447/images/$1$2 redirect;
    return 302;
  }

}

subnetmarco commented 9 years ago

Matches both host and path Matches host Matches path

@montanaflynn I agree with this.

melihmucuk commented 9 years ago

+1

alexkrauss commented 9 years ago

+1

rosskukulinski commented 9 years ago

this feature would be a requirement for us to adopt kong. (+1)

subnetmarco commented 9 years ago

I just would like to tell that this feature is coming in the 0.3.0 release about 3/4 weeks from today.

DavidTPate commented 9 years ago

@thefosk Thanks for the quick response :+1:

sonicaghi commented 9 years ago

:tada:

thibaultcha commented 9 years ago

There are a lot of things we need to figure out before implementing this.

Requirements

API can be matched by Host.
API can be matched by Path.
API can be matched by Host + Path (higher priority over 1 and 2).
Paths should allow regexes.
An API can have 1 Host (as of now, assuming we're not changing that here).
An API can have multiple Paths, with a prioritisation system.
A Path can have a strip property (ignored here).

Schema 1

Considering this, and the way we want to query Cassandra, all that by keeping our RESTful configuration capabilities, this is a potential model (and the most valid I could think of):

CREATE TABLE apis(
  id uuid,
  name text,
  PRIMARY KEY(id)
);

CREATE TABLE hosts(
  id uuid,
  api_id uuid, -- foreign to apis.id
  public_dns text,
  target_url text,
  PRIMARY KEY(id)
);

CREATE TABLE paths(
  id uuid,
  api_id uuid, -- foreign to apis.id
  host_id uuid, -- useful to require this Path to first match a Host (rule 3)
  listen_path text,
  priority int,
  target_url text,
  PRIMARY KEY(id, priority) -- priority allows us to ORDER BY a query, but I would probably rather do that in the application level
);

CREATE INDEX ON hosts(public_dns);
CREATE INDEX ON paths(listen_path);

This schema allows us to follow the requirements:

Query by Host and find (or not) an API
Query by Path and find (or not) an API
If a Path was found, check if it has a host_id
- 3a If it has a host_id and it matches the one of the previously found Host(s), Path is valid -> redirect
- 3b If it doesn't have a host_id or it doesn't match the previously found Host -> next
If a Host was found -> redirect
If nothing happened at this point -> drop

The problems here are:

We are making 2 queries to the DB per "non-cached-call" (one for Host and one for Path). This will be slower but we do have a database cache, so not significantly slower either.
If we want an API to have multiple Paths (/path1, /path1/overriden), this schema will force us to query all Paths from the DB to be able to compare them with the current URI. That is actually the case for all models except the presented schema 2.
If we want to support regexes in Paths, same: we need to query them all.
Standard foreign relations issues (not major).
- 4a If a Host is deleted, we need to update all the Paths having it as a host_id.
- 4b If an API is deleted, delete all related Hosts and Paths

Proposed workarounds

I see no way to fix 1., even with a different schema, that will be a drawback of having to support both resolvers all the time which is why I advised against doing so. Having #15 would help but I don't see that coming anytime soon.
One way to fix 2. and 3. would be to have all the Paths in memory all the time. This could maybe be done, even tho it will add complexity for sure. Maybe we could load the Paths at startup, and reload them if they are modified. For it to be efficient, the way they'll be stored in memory will mater (sorted by Host for ex). We would basically reinvent the wheel here, as this is what nginx already does via configuration as discussed here.

Schema 2

I also considered such a schema:

CREATE TYPE path(
  listen_path text,
  priority int,
  host text,
  target_url text
);

CREATE TYPE host(
  public_dns text,
  target_url text
);

CREATE TABLE apis(
  id uuid,
  name text,
  host host, -- one Host
  paths frozen<set<path>>, -- multiple Paths
  PRIMARY KEY(id)
);

But it arises more concerning problems:

Impossible to query a Path only by listen_path without knowing the other properties of a Path type in advance. I wasn't able to find a way to index Collections of UDTs in Cassandra.
The driver still doesn't support binary protocol v3 which I think is required anyways for UDTs. (jbochi/lua-resty-cassandra/pull/57)

From here, I see 2 solutions:

Revise downwards our expectations about this resolver (no regex matching, unique Path per API, no Host + Path matching)
Stick with the 1st model but eventually expect memory/performance/configuration drawbacks.

Migrations

Finally, another problem to consider is that almost any schema change will require a heavy migration. By heavy I mean moving data around, possibly by providing a script or something to migrate from the current apis table to any of the newly created tables. That means our migrations will not be able to do the job. We need something that:

Creates the new tables
Move the data around
Delete the old tables

or

Create a new schema in a new instance
Move the data from the old instance
Reload Kong

All that should be done with users doing a backup of their data first. Kong is not 1.0 yet so I don't see handling that as a priority. Users should expect having to reconfigure their APIs if they want to upgrade.

subnetmarco commented 9 years ago

I have some feedback and questions.

Regarding the requirements, since we allow multiple paths we could also allow multiple Host? Likewise with paths, an API will be matched if any of the Hosts matches.
I think it's important to support regular expression to handle at least one scenario: many developers have multiple paths in the format of [A-Za-z0-9_\-]+\.myapi\.com, that match all the paths specific to a user like user1.myapi.com or user2.myapi.com, etc. If we decide that a full regex support can't be implemented, or it will take too long, it would be nice to at least cover this one scenario (which from my experience with Mashape, will cover 90% of the times a regex will be used). It can be done in many ways without introducing a real regex support.
Regarding the schemas, can't we just add one more field to apis (similar to your second solution), and query it like:

SELECT id FROM apis WHERE public_dns = ? OR listen_path = ?;

or (if possible in Cassandra):

SELECT id FROM apis WHERE public_dns CONTAINS ? OR listen_path CONTAINS ?;

Not sure if there is any limitation with Cassandra if we do this.

Regarding the priority, I would say to remove it in the first implementation if that will allows us to use a simpler schema.

thibaultcha commented 9 years ago

we could also allow multiple Host

Yeah I thought about it, but it brings a lot of configurations headaches, because one could have 2 hosts, 2 paths that only validates if Path A + Host A, and Path B + Host B, but it can be extremely confusing very fast. But having 1 Host and X Paths, we respect the nginx behaviour as showed in the examples in this thread. I think it does more harm than anything.

I think it's important to support regular expression

See my point about supporting it: it means everything will have to be in memory and the routing will be O(n), because we need to compare a path against every configured Path. Also your example is a Host? Even if we support 1 Host per API, same, we would need to have every Host in memory too. (See the conclusion about that)

Regarding the schemas, can't we just add one more field to apis and query it like:

Cassandra does not have support for such an OR.

Regarding the priority, I would say to remove it in the first implementation if that will allows us to use a simpler schema.

If we drop this we absolutely cannot have multiple Paths per API like @DavidTPate described it (/auth/login and /auth would overlap). Ex: if one sets a listen_path to: /pizza/, another to /pizza/hello and queries /pizza/hello/world, which listen_path gets applied? We can't know without a priority value, or having them ordered as an array.

Our problems are:

Supporting both Host and Path by default for all APIs: double cassandra querying
Supporting regex in Path or Host: everything will need to be in memory, O(n).
Supporting multiple Paths per API: see the example above. We do need a priority property.
Supporting a Path with multiple parts: if one sets a listen_path to: /pizza/api and queries /pizza/api/hello/world. From the code's POV, am I supposed to query Cassandra with /pizza/, /pizza/api/, /pizza/api/hello/ or /pizza/api/hello/world/. That is why we need to support something like "starts with or strict" modes, or just regex in a first version.

To conclude, if we want to stick with those requirements, and fix 1, 2, 3, 4, I think we have no choice but to load the Host(s) (plural if we decide to support many Hosts, but that brings configuration concerns as mentioned) and Paths in memory. And somehow reload them when they get modified. After all, it is what nginx is doing too, except you don't expect a configuration file to have tens of APIs, where you can expect Kong to have such a number. Schema 2 or equivalent would be valid in that case.

sonicaghi commented 9 years ago

Keep one host for this version. Simpler.

subnetmarco commented 9 years ago

Just brainstorming here, but another options would be having only one property called matchers or patterns (or a better name) that contains both the DNS or the path. The table would look like:

CREATE TABLE IF NOT EXISTS apis(
  id uuid,
  name text,
  matchers set<text>,
  target_url text,
  created_at timestamp,
  PRIMARY KEY (id)
);

We could support multiple DNS and multiple paths in one field:

SELECT * FROM apis WHERE matchers CONTAINS 'something.com' ALLOW FILTERING;

or

SELECT * FROM apis WHERE matchers CONTAINS '/hello/world' ALLOW FILTERING;

This won't fix the two-queries problem because I think Cassandra doesn't support SELECT statements to search for multiple values in a field (I might be wrong):

SELECT * FROM apis WHERE matchers IN ('something.com', '/hello/world') ALLOW FILTERING;

thibaultcha commented 9 years ago

First implementation drafted in #282. It only supports 1 path per API. Since supporting all the requested features means a lot of rewritten code, I opted for breaking down the implementation in 2 parts:

Supporting 1 path per API and start implementing an in-memory resolver (#282)
From there we'll add support for multiple paths, with priority and regex.

thibaultcha commented 9 years ago

Closing this and adding support for multiple path/multiple hosts in one of the upcoming releases. Thank you all!

Kong / kong