TykTechnologies / tyk

Tyk Open Source API Gateway written in Go, supporting REST, GraphQL, TCP and gRPC protocols
Other
9.64k stars 1.08k forks source link

Tyk master #32

Closed camann9 closed 9 years ago

camann9 commented 9 years ago

This is basically a ticket to discuss ideas about the structuring of Tyk so we don't have to keep polluting #23 :-P

Thanks for the extensive reply. Could you be bothered to draw up a diagram similar to the one I drew which details the current communication paths (includeing the implicit ones through databases)? I think it would really help understanding the current architecture. I dumped the powerpoint slide with my diagram to http://s000.tinyupload.com/index.php?file_id=87456586263021545868

Two more questions: Is there a system behind where what information is stored? In Redis I see tyk-admin-api-XXX and apikey-XXX, those are both keys. In mongo I see tyk_analytics_users, tyk_apis and tyk_organisations. Am I correct assuming that metadata is stored in Mongo and keys in Redis? Why did you decide to use two database backends?

lonelycode commented 9 years ago

Here's the architecture of a Tyk setup (roughly), the reload signal is a fanout signal that hits multiple host managers:

Tyk Architecture

To answer your questions:

  1. tyk-admin-api-XXX - this is a user session in the dashboard, not used by Tyk at all.
  2. apikey-XXX - this is an API key that will give access to a stored API (depending on settings stored under that key)

The things you see in mongo are all mainly related to the dashboard:

We use two backends because Tyk can work entirely without Mongo (you could purge the analytics to CSV if you like), and we needed something to manage the dashboard and Mongo had better features for that.

Mongo was chosen for the dashboard because of the built-in aggregation framework in v2.2+, this makes data aggregation into useful analytics data much easier ,as data pipelines can be built and handled in Mongo instead of being crunched by the dashboard. It makes things much cleaner in terms of code and flexibility and puts less stress on the application servers.

We decided from the start that Mongo is not a dependency to use the core gateway, only if you want a GUI and an easy way to get to view and filter analytics data, it's a value-add, not a requirement, Tyk can work entirely with only redis.

This means that all functional data is either stored in a file (Definitions) or in Redis (Keys). Also worth noting is that the Session and Authentication handlers use separate storage interfaces, so strictly speaking, you could swap out Redis for something else on an api-by-api basis if the correct interfaces are implemented. So loosely speaking, even Redis isn't even a requirement, so long as you can build a new Storage interface driver, which I imagine some forks may be doing.

As discussed in the other ticket, API Definitions are file-based because ultimately this is the simplest and most robust thing there is for an infrastructure service (see NginX, Redis and Apache all file-based configurations), the version data stored in Mongo are in fact almost exactly the same except for some additional metadata for the dashboard (an API Definition is a JSON document, which makes portability between Mongo and file really easy and painless), all the meta-data fields are completely ignored by Tyk as they are only so that the dashboard can do clever things like portable webhooks and event handlers.

Both projects (Tyk and the Dashboard) share the tykcommon package, this defines the APIDefinition object so that they are always compatible with one another, even if one is upgraded faster than the other.

camann9 commented 9 years ago

Thanks again, the support for Tyk is truly awesome :-) . I think the diagram is very enlightening so maybe you want to put it into the official documentation.

So to summarize, the key data and the request journal are stored in Redis and everyting else is stored in Mongo.

In the light of my newly acquired knowlegde I understand what you were talking about in #23 . What about actually putting the metadata (except for aggregated analytics) into Redis? Then Redis would be the master that stores API definitions and organizations. If Redis is updated by one node, the nodes that are attached to it reload (let's say they poll every minute or when the manual reload is triggered). Then they don't need to know of each other. They just synchronize via Redis and it doesn't matter how many there are. We also don't get a new SPOF since the nodes are dependent on an available Redis instance anyways to access the sessions.

lonelycode commented 9 years ago

I may do that :-) Would probably clear up some things for our users...

As for putting everything in Redis, this is an option, however, as I mentioned API Definitions can actually set which session stores and auth handlers they want to use on a per-api basis, although Redis is the only supported store at the moment (there is a deprecated in-memory version too), it's a simple interface that could easily be transferred out to other k/v stores (or DB's for that matter).

It would mean that Redis becomes an overarching requirement for system management as opposed to a per-implementation dependency, which makes the current code quite flexible (I hope).

For example, if we decided to shift to BoltDB or Riak or Tokyo Tyrant instead of Redis, it would be very easy as we simply implement the Storage{} interface and register it in the right code hooks.

There's the Raft algorithm (goraft) which might be worthwhile looking at to sync up API configurations across a cluster, which makes having an API endpoint sensible again, as you could just query the cluster for the leader, then make a REST API call to push the new Definition, it would flush to file on the leader, which would then replicate to all the nodes organically.

This same functionality could then be used to push hot-reloads across a cluster without individual polling of the nodes and ensure that if things fail, there's always a leader to step up, you could even incorporate some functionality to auto-start more nodes using a webhook or some event handler (the infrastructure is already there for that).

On the other hand, sticking it all in redis really would be so amazingly simple, I'm of two minds on it - there's a lot of advantages to having the nodes sync up and elect leaders:

Putting the definition into Redis would mean the dashboard needs to speak to Mongo and redis and there would be two copies of the definition floating around, which could introduce drift.

I'm just thinking out loud here... sorry for the rant :-/

I will have a play around with Raft to see how much of a pain it would be to implement well, a POC would set things straight and make it a bit clearer, if there's no existing implementation that we can use, there's no way we're rolling our own, so the Redis option will be what we end up with.

camann9 commented 9 years ago

I still thing that the distributed master election is a bit overengineered. AFAIK Redis satisfies all the requirements we have.

You could still define interfaces that encapsule the storage functionality. One interface called ApiStorage with a Redis and a file backend and one interface called MetricsStorage with a CSV and a MongoDB backend. What do you think?

lonelycode commented 9 years ago

Interesting, more thoughts (sorry, this is a lengthy one):

Let's turn this on it's head a little and start with the desired behaviour:

From an end-user perspective, setting up a single tyk node, or booting up twenty should be seamless, and require little to no configuration on my part. In a typical cloud environment it should be assumed that the node will fail, shut down or be arbitrarily rebooted for maintenance, this should not affect the performance of the application or the cluster and I should still be able to administer it remotely using the API.

Having Tyk nodes own their own configuration management in the above scenario creates the following problems:

  1. Which node "owns" the configuration, and how is this selected? We don't have high-concurrency requirements as API defs change more slowly, the risk of "stale" data is low as configurations are pushed out across the nodes, it should also work with 1, 2 and n nodes.
  2. When I make a request into the cluster, how do I know which one to send the request to? I would assume any node that replied would tell me which node was the current master, and then I could re-run my request there.
  3. How does a single-node install become clustered? Ideal scenario Just boot a new box with the same configuration and they mesh.

A potential implementation:

  1. Introduce a clustered mode, this is set in the configuration of a Tyk node (tyk.conf)
  2. Clustered mode meta options include a redis store configuration, separate to the key store (it may use a different DB or a different server altogether for this, tuning a long-term storage redis DB vs. a ephemeral key store would be a requirement for many end-users, it also reduces SPOF)
  3. When a tyk node starts up:
    • It grabs all the keys with the prefix tyk.node.id.*, then sorts them by their value (a random integer between 1 and n)
    • The node with the smallest randInt() is the master, it stores this value locally so it can respond to the /tyk/master request
    • It creates an ID for itself and creates a key in redis that looks like this: tyk.node.id.{{IP}}.{{Port}}: randInt() with a TTL of 20s and a floor of the lowest value in the list (so it appends itself), it it is the only node, the floor is 0.
    • It starts a goroutine which updates the TTL of its ID key to 10s, this runs every 5s
      • This goroutine will also regularly pull the key list and do the sort to update its master ID
    • It creates a pub/sub subscription to a tyk.nodes.reload key, if it receives a message, it triggers a hot reload
  4. When a tyk node fails, or is destroyed:
    • The node will stop updating the TTL, so the key will expire
    • The other nodes will update their master keys within 10s of the key expiring
  5. Tyk pulls it's API definitions from Redis keyspace tyk.definitions.* and loads them to memory
  6. A write request to /tyk/apis on the current master will overwrite the key in redis
  7. To reload the nodes, a new API request /tyk/actions/reload (or something similar) on the master will send a trigger message via the redis pub/sub channel, only the master may write to this channel
  8. If a machine has detected it is the master, it will activate it's purge loop (actually, the purge loop will already be running, but will skip execution if it isn't the master)
    • A change to master state should trigger a system event so reactions can be scripted (e.g. boot another instance - event data should include most current instance list.)

So, what does this system mean, worst case scenario:

What does this enable? Looking at the above list of things we want to achieve:

From an end-user perspective, setting up a single tyk node, or booting up twenty should be seamless, and require little to no configuration on my part. In a typical cloud environment it should be assumed that the node will fail, shut down or be arbitrarily rebooted for maintenance, this should not affect the performance of the application or the cluster and I should still be able to administer it remotely using the API. - :+1:

And from the benefits of a self-managing cluster:

And above all, we can now store API Definitions centrally in redis, trusting that they are managed by only one node, removing the requirement for MongoDB altogether (and the host-manager for that matter).

Thoughts?

camann9 commented 9 years ago

Thanks for the long reply :-)

I still have some problems with the idea of a master election. For one, I could not just go to any node to add APIs, I would have to go to one node that tells me who is the master. Then I can query the master. This complicates things for clients. The problem is also that the Tyk master may not be reachable from the outside if there is a firewall/LB between the client and the Tyk master. The LB only allows us to send our query to an arbitrary node (which might not be the master), not to a specific one. I would be more happy if the Tyk node would just forward the query to the master (if you really want to go with the master/slave method).

With the solution I proposed you wouldn't have any downtime or addressing problems because every node has the same rights to write to Redis. The question is really whether API definitions are more special than keys and why we should treat them differently. Why should writing API defs be restricted to one node but not writing keys?

About your annotations to my proposal:

As you see, I'm not a fan of the whole "master" thing. To me it doesn't provide any benefits compared to synchronizing via Redis. Every Tyk node has to have the code to write to Redis since any one might become master. And careful synchronization is necessary anyways. So why make it more complicated that necessary. The only thing a dynamically elected master would bring is better ways of synchronizing but I think that's not worth all the trouble since it can also be achieved differently (as described).

lonelycode commented 9 years ago

I agree with you, I think this whole discussion has become a little academic, and it really doesn't need to be, it boils down to:

So the action from this, really, is to document the Dashboard API - which I really should have done a long time ago.

If we were to add an endpoint to update or add an API definition, it should just flush to disk, and the integrator would need to manually update all hosts (not too hard, it's a loop through all running nodes). Followed by an API call to all running nodes to reload.

That would be the simplest thing to do.

Then if an integrator would rather have a centrally managed service, they can use the (soon to be documented) dashboard API. Let me explain why the API that ships with the dashboard is better:

You get this hierarchy in the Dashboard API:

The web app is basically a REST client to the dashboard API. Since it's meant as a C&C API, it has much more functionality than the main Tyk one does.

so basically - this needs documentation, and potentially the API endpoint needs fleshing out to flush to disk.

:-D

camann9 commented 9 years ago

The thing with making the dashboard the master is that people would have to license the dashboard to be able to administer Tyk. I like the idea of making the dashboard for-pay and the node for-free. Every sane person in an enterprise environment would of course license the dashboard instead of building it himself (at least with the current pricing). But to couple the dashboard GUI to the administration API and license them together is not an idea I like. Why not leave the administration API open-source and free and just license the actual dashboard? Of course this is a business decision rather than a technological decision :-)

lonelycode commented 9 years ago

It's an idea worth considering, to be honest it felt quite awful crippling software like this.

The dashboard is actually just a large API server and a separate webapp, so you never need to touch the dashboard, you could just run the application and use the API directly, the license is actually for the dashboard and the finer-grained management API.

Implementing an API endpoint that will flush a configuration to disk is a compromise I'm quite happy with, it's actually the only bit of functionality that is exclusively in the dashboard. So that's one element that we'll put on the roadmap and get out with the next release.

Regarding lifting restrictions on the dashboard API, we'll have a think, we're quite focussed on getting more people to use the software so it's a bit more... complicated.

:-)

lonelycode commented 9 years ago

Closing this ticket, we're going to go with Adding REST anf file based flush to Tyk for now, reverting this back to ticket #23 and tagging as appropriate.