nickzuber commented 7 years ago

ClusterRunner API Versioning

Overview

The ClusterRunner API currently doesn't implement any kind of versioning, which limits us from making any kinds of enhancements or improvements to any existing API features in the future. For example, one of the changes we'd like to make is to implement pagination by default. This kind of a change would cause issues for anyone currently consuming the API, which is why we would need to release this kind of update on a new version.

There are two main approaches for handling API versioning, which are basically just two different places we can specify the version:

URI
Media Types

In general, having the API version in the URI is a simpler approach to implement but using media types is more academically accepted.

Approach: URI

Using the URI to specify the API version is really simple and it's something that ClusterRunner currently simulates:

/v1/build/1/subjob/1/atom/0/console

With the current way we build routes in our API, and considering that we already simulate a version in our URI, making this change would be really straightforward.

Our routes are structured like this:

RouteNode(r'parent', _SomeParentHandler, 'atom').add_children([
    RouteNode(r'child', _SomeChildHandler)
])

The regex that corresponds to an endpoint is constructed from the parent routes' regex. Each time we have a capturing groups in the regex, the values captured will be passed in as parameters for the handler's HTTP method. This means if we wanted to add the version to the URI, we could capture the version in the route and handle it that way.

RouteNode(r'v(\d+)', _APIVersionHandler, 'atom').add_children([
    RouteNode(r'example', _ExampleHandler)
])

class _SomeChildHandler(_ClusterMasterBaseAPIHandler):
    def get(self, api_version):
        # Do things based on api_version
        response = { ... }
        self.write(response)

Pros

Easy & simple to implement in our current API
Obvious which version is being used
Simple to test

Cons

Same resources can be selected with different URIs
Goes against HTTP specifications
Upgrading API can be tedious

Approach: Media Types

Putting the API version in a media type in request headers is generally the more accepted way to handle versioning since a request URI is described as a Uniform Resource Identifier that identifies the resource upon which to apply the request.

Requests/reponses generally are structured something like this:

===>
GET /v1/build/ HTTP/1.1
Accept: application/vnd.clusterrunner.v1+json

<===
HTTP/1.1 200 OK
Content-Type: application/vnd.clusterrunner.v1+json

There are a few different ways to store the version number in the media type:

application/vnd.clusterrunner.v1+json
application/json; version=1
application/json; vnd.clusterrunner+v1

Most of the examples I've seen use the format application/vnd.clusterrunner.v1+json, and also it seems like this format is the best option since it best follows the JSON API specification guidelines.

The vnd prefix is a media type standard for any media types associated with publically available products. For future reference, it seems like these media types might need to be officially registered, but I had a hard time finding specific and recent resources on this topic so it might be irrelevant now; just something to keep in mind.

Capturing the version would be pretty straightforward as well. We would first just need to get the header and its value, then parse it for the version. It could look something like:

import re

class APIVersionHandler(object):
    API_VERSION = 2

    VENDOR_MEDIA_TYPE_GROUP = 'application'
    VENDOR_MEDIA_TYPE_SUBGROUP = 'vnd.clusterrunner'

    @staticmethod
    def getVersion(value, uri):
        reMatches = re.match('.*({}/{}.v(\d+)\+json).*'.format(
            VENDOR_MEDIA_TYPE_GROUP, 
            VENDOR_MEDIA_TYPE_SUBGROUP), value, re.I)
        try:
            return int(reMatches.group(2))
        except IndexError:
            # Didn't capture expected groups
        except ValueError:
            # Didn't capture a number
        return self._defaultAPIVersion(uri)

    @staticmethod
    def _defaultAPIVersion(uri):
        # If `v1` is in URI return 1 else API_VERSION

content_type_value = self.request.headers.get('Content-Type')
api_version = APIVersionHandler.getVersion(content_type_value, uri)

Pros

Maintains integrity in the URI
Is the most correct way to version according to HTTP specs
Easy to upgrade API to latest version

Cons

Not as trivial to test
Figuring out what version you're consuming is more complicated
We already have a lot of infrastructure build for versioning in the URI

Other Things to Address

Default Version

When a version isn't specified, we need to decide which version of the API to provide. We really have two ways of handling this:

Provide the first version
Provide the latest version

Serving the first version by default wouldn't be ideal since it makes more sense to only provide older versions of the API if specifically requested. Whenever somebody requests a resource, it's fair to expect the most recent version of that resource by default.

The downsides to this is for this initial change, anybody that's currently consuming the API would get updated automatically to the newest version after this is implemented (before they're able to react to the changes). However, this can be avoided if we keep the v1 in the URI as an exception that routes to the first version of the API if a version isn't already specified in the media types.

Handling the Existing URI

All of our ClusterRunner endpoints and routes already have a faux version number in their URI, so we'd need to handle that if we choose to use media types for our API versioning. We have a few different ways for handling this:

Remove all instances of v1 from the URIs and use only media types
Keep all instances of v1 from the URIs and make a permanent part of the URI
- Specify version with media types regardless and effectively ignore the v1
Keep any instances of v1 from the URIs and make it optional to include
- Specify version with media types but if v1 is found in the URI, only serve first version of API
- Specify version with media types regardless and effectively ignore the v1
- Specify version with media types but if none are found and there's v1 in URI, serve first version of API

Trying to remove all instances of v1 from our URIs is effectively a huge breaking change. This means any process already consuming the API will effectively hit 404s and stop working correctly until they themselves manually remove all instances of v1 from their endpoints. Doing this wouldn't be a great idea since our API would not be backwards compatible with anything that's already using it, and backwards compatibility is something we want to help mitigate the amount of breaking changes.

Permanently keeping the v1 in the URI wouldn't really make sense either since it would just cause confusion since it doesn't describe the resource that's being provided nor does it specify the version of the API. We should deprecate the v1 in the URI and effectively make it optional in our routes. That way, if somebody is already consuming the API and has the old v1 in their endpoint, they'd still be able to specify the API version with media types and not have to worry about removing the v1 if they don't want to.

We should use media types for versioning whenever they're provided. If the version is not provided through media types, then we should serve the most recent version of the API, unless there's v1 in the URI, then we should serve the first version of the API. This is to help counter as many breaking changes as we can. Anyone currently consuming the API that has v1 in the URI shouldn't get automatically upgraded to the most recent version of the API unless they want to and are ready to.

Some examples of behavior we'd want:

GET /v1/build/path/to/resource HTTP/1.1
Accept: application/vnd.clusterrunner.v2+json
--> API Version 2

GET /v1/build/path/to/resource HTTP/1.1
Accept: application/json
--> API Version 1

GET /build/path/to/resource HTTP/1.1
Accept: application/vnd.clusterrunner.v2+json
--> API Version 2

GET build/path/to/resource HTTP/1.1
Accept: application/json
--> API Most Recent Version

Proposed Design

Even though our ClusterRunner API is currently designed with a faux version in the URI, it'd be worthwhile to switch to versioning in the request headers primarily because it's a more accurate way of describing the resource being provided. Media types are designed for describing data and formats in the response, whereas the URI should represent the actual data. We want to avoid mixing up the resource identification with the resource representation, which would contradict HTTP specifications for what a URI should represent.

We should control our API version through the Accept/Content-Type request/response headers and handle the version after we capture it from the header.

If no version is specified within the header & there is no v1 in the URI we default to the most recent version of the API.
If no version is specified within the header & there is a v1 in the URI we default to the first version of the API.
If a version is specified within the header & there is no v1 in the URI we use the version specified in the media type of the API.
If a version is specified within the header & there is v1 in the URI we use the version specified in the media type of the API.

Update

For changing the name of a route path, we can have some dict that assigns a route name with the API version and we select the correct one before we build the route. We could probably come up with a nicer and more uniform way of doing this, but here's a very rough idea of what it could look like:

If we keep that in v2 API will there be changes necessary to keep it working correctly (i.e., only show child routes within the v1 API when getting responses for v1, and similarly for v2)?

We shouldn't need to change anything here if we build the routes according to the API version, rather than accepting all possible routes (after we change names etc.) then try to figure out which route is valid per version.

builds_path = {
    VERSION_1: 'build',
    VERSION_2: 'builds'
}.get(api_version, 'builds')

RouteNode([r'{}'.format(builds_path), _BuildsHandler, 'builds').add_children([
    # ...
])

Edit

It might be hard to be able to identify the version before we build the routes, since the application needs to have some routes already before any request can even be sent. So I propose we could adjust the RouteNode class to take in a list of dicts which define a path and the versions that are valid. This then can be passed into the handler (which knows the version of the API when it gets a request) and can be handled appropriately. It could possibly look something like:

# ...
RouteNode([{name: r'build', start: 1, end: 1},
           {name: r'builds', start: 2, end: latest_version}
          ], _BuildsHandler, 'builds').add_children([
              # ...

To deprecate the v1 we can adjust the RouteNode class to handle "optional" routes. This would just be a flag on a RouteNode which makes the route's regex path a non capturing optional group. Something like:

RouteNode(r'v1', _APIVersionOneHandler, optional=True)

class RouteNode(object):
    """
    A tree data structure for representing the parts of a url path, for routing.
    """
    def __init__(self, regex_part, handler, label=None, optional=False):
        # ...
        self.optional = optional

    def regex(self):
        """
        The route's regex, used to register this route with the web framework
        :rtype: str
        """
        ancestor_regex_parts = []
        for ancestor in list(reversed(self.ancestors())):
            ancestor_regex = ancestor.regex_part.rstrip('/') + '/'
            if ancestor.optional:
                # Non capturing so we don't accidentally pass this as a parameter. 
                ancestor_regex = '(?:' + ancestor.regex_part.rstrip('/') + '/)?'
            ancestor_regex_parts.append(ancestor_regex)

        return r''.join(ancestor_regex_parts + [self.regex_part]).rstrip('/') + '/?'

wjdhollow commented 7 years ago

Thanks for presenting both sides of the argument and providing links to documentation. 👍 to versioning request headers.

ethomas2 commented 7 years ago

Sorry for not realizing this until now, but i'm starting to think it doesn't really make sense to introduce versioning into cluster runner just to support pagination. To introduce pagination it seems really easy to just add flags to the /v1/builds endpoint, something like /v1/builds?start=n&range=m. My issue with a versioned api is that it's effectively a promise that we're freezing the current behavior of cluster runner and we're gonna support it forever. Now everytime I make a change to cluster runner, I have to add an if statement "If version=1, do the old behavior, else do the new behavior". It sounds like it would be pain to maintain and lead to some complicated code. Correct me if i've misunderstood the point of a versioned api.

As an example, I was just talking to @josephharrington,@nadeem about changing the behavior of cluster runner so that when a slave gracefully disconnects, the master stops reporting that slave in the /slaves endpoint. If we introduce versioning, we'd have to continue reporting the old dead slave in the v1 /slaves endpoint. That check is inelegant and complicated. I also don't want to have to keep these old slaves around in memory just to support an old version. If cluster runner was a super mature service that we didn't expect to make many more changes to, then maybe i'd have a different opinion, but we're not even to the 0.9 milestone. I want to be free to move quickly and break things.

Another thing that makes versioning not really make sense imo is that we already do releases of cluster runner every time we make a commit, which is effectively a form of versioning. Other companies are already in complete control over what version they're using, because they can just build the rpm for the version they want. This is a different model most web companies (eg twitter). Twitter has to version its api because when they make a breaking change they will actually break every single consumer of twitter. Unlike twitter, we can tell users of cluster runner "if some breaking change we made messes you up, just don't upgrade to the most recent version".

josephharrington commented 7 years ago

@ethomas2 There are other changes that justify versioning the API beyond just pagination. One simple one is that I'd like API resource names to be plural (e.g., /builds/<id> instead of build/<id>). Another is #282.

Even for pagination, yes it is true that we could add it to the v1 API with query params, but that doesn't solve the problem that having an endpoint (GET /build/) that can lock up the application for a long amount of time while it returns a huge amount of data is a design flaw. That endpoint should not allow unpaginated requests.

My issue with a versioned api is that it's effectively a promise that we're freezing the current behavior of cluster runner and we're gonna support it forever.

I don't agree with that. Versioning an API is a migration strategy. It allows us to make breaking changes without breaking clients. We can then deprecate older versions of the API.

when a slave gracefully disconnects, the master stops reporting that slave in the /slaves endpoint. If we introduce versioning, we'd have to continue reporting the old dead slave in the v1 /slaves endpoint.

A benefit of API versioning is that we get to define what we consider breaking API changes. In that example I don't consider that a breaking API change, I consider it a bug fix.

Another thing that makes versioning not really make sense imo is that we already do releases of cluster runner every time we make a commit, which is effectively a form of versioning.

The difference between an application version and an API version is important. I should be able to upgrade a service without breaking the automation that I've built on top of it. Users should be able to upgrade ClusterRunner to get things like bug fixes and new features without needing to update all their API client logic (at least until the point where we've deprecated an API).

josephharrington commented 7 years ago

@nickzuber Evan's comment actually leads into my main point of feedback -- this all looks great. I'd also love to see your ideas for how implementing endpoints in the new API version would look. For example, would we start out by mapping back to the old implementations and then replace those with new implementations as desired?

nickzuber commented 7 years ago

@ethomas2

To introduce pagination it seems really easy to just add flags to the /v1/builds endpoint, something like /v1/builds?start=n&range=m.

I originally brought up the same point saying that pagination isn't a breaking change if we just add query variables in the URI, however @nadeemahmad told me that we want to have pagination be a default behavior with default start/range values -- and that would be a breaking change.

Now everytime I make a change to cluster runner, I have to add an if statement "If version=1, do the old behavior, else do the new behavior".

I was thinking that the version would only really come into play if there was different behavior between versions specifically. So you'd only have to check for the version if you needed to. But yeah it would probably be more complicated to maintain without a good way of handling the different versions well.

Another thing that makes versioning not really make sense imo is that we already do releases of cluster runner every time we make a commit, which is effectively a form of versioning.

Isn't this more of ClusterRunner versioning rather than just the API? I don't think it'd be a good idea to tightly couple these two together -- if we make a breaking change with the API but also a huge improvement with ClusterRunner itself, consumers should be able to upgrade ClusterRunner without necessarily having to upgrade their usage of the API.

ethomas2 commented 7 years ago

The other api fixes @josephharrington mentioned convinced me that versioning would be nice. I thought we were only doing versioning for pagination, my argument doesn't really work for those 2 changes. I'm also convinced that we should separate the concept of versioning the API and versioning the service. But I'm still not convinced that versioning is necessary for pagination. If the /builds endpoint is unusable anyway, then it seems like we should just make a breaking change to paginate it by default on the v1 endpoint. Are we planning on making a change soon for which versioning is necessary? If so i'm all for it, otherwise this seems premature.

nickzuber commented 7 years ago

@josephharrington Something like that -- our goal should be to reuse as much code as possible and avoid repeating ourselves among the different API versions. I think the tricky part is gonna be finding that balance and an organized way to do it, maybe some sort of HoF or something.

cmcginty commented 7 years ago

Nice writeup!

@ethomas2, For your 3rd point, even if we took the approach of forcing clients to update to match versions (i.e. not being backward compatible), it still makes sense to provide at least some API versioning feature. For example, if a client plans to support v2 API, those changes can be integrated in the client and released before ClusterRunner master is upgraded. The client can then detect the master API version and behave accordingly. In other words, the CR design should not limit the client implementation if it wants to support multiple CR API versions, even if the CR releases will only support a single API version.

josephharrington commented 7 years ago

@nzuber

I think the tricky part is gonna be finding that balance and an organized way to do it

I think you are correct here. Because of that it might be good to include a bit more details around this in your proposal just to make sure everyone understands all the implications of the proposed design.

One good example to illustrate where it could get tricky is in renaming endpoints (e.g., from /build/<id>/ to /builds/<id>/). In the versioning approach where the version is embedded in the url (/v1/build/<id>/ vs /v2/builds/<id>/) then the API routes are easy to change between versions because they're literally different routes. But in the media type header approach, then each handler has to switch its behavior depending on the detected version. A few points that might be tricky:

I'd assume that /v1/build/<id> would return a 404 if the v2 API is requested, and similarly /builds/<id> might return a 404 under the v1 API. Like you said, we should share code wherever possible, but because of versioning the handler logic has to have different behaviors based on the requested version. So I'd be interested in what mechanism might handle switching behaviors between the two versions before entering shared logic.
We have some API discoverability implemented via the "child routes" field of responses. If we keep that in v2 API will there be changes necessary to keep it working correctly (i.e., only show child routes within the v1 API when getting responses for v1, and similarly for v2)?

nickzuber commented 7 years ago

@josephharrington Another tricky thing that I noticed when passing the version through as a header is that hitting an endpoint through the browser doesn't use our network class, rather it just writes from tornado RequestHandler, so it seems like we'd have to add headers in two different places within CR (in the network class and in the response header for our endpoints? I might be misunderstanding how some things work here, feel free to correct me).

For handling the route names, I think we could adjust the RouteNode class to take a list of regex and version tuples. This way we can build our routes based on the version of the API being consumed. A route could default to "any" version of the API if no specific version is specified in the tuple.

Also, I was thinking again we could alter the RouteNode class to have the option of marking routes as "optional" (each route won't be optional by default). We could implement this with regex as well.

I'll update my design doc with more specific implementation details

josephharrington commented 7 years ago

hitting an endpoint through the browser doesn't use our network class

That is true for any incoming API request to ClusterRunner whether it's through a browser request, curl, or from another service. The Network class is only for requests that the service makes (in other words, outgoing requests).

The only reason we'd need to add headers to the Network class is if we want communication between the master and slaves to use the v2 api. I believe in your design, if we don't change anything about the Network class, then communication between master and slave will still use the v1 api because the routes would still include "/v1/".

nickzuber commented 7 years ago

@josephharrington I updated my doc to clarify some specific implementation details I had in mind

cmcginty commented 6 years ago

Can we close this?

box / ClusterRunner

Proposal for versioning the API #375

ClusterRunner API Versioning

Overview

Approach: URI

Pros

Cons

Approach: Media Types

Pros

Cons

Other Things to Address

Default Version

Handling the Existing URI

Proposed Design

Update

Edit