DemocracyClub / aggregator-api

https://developers.democracyclub.org.uk/
3 stars 3 forks source link

deal with persistant state #9

Open chris48s opened 5 years ago

chris48s commented 5 years ago
chris48s commented 5 years ago

API keys:

At the moment we are handing out API keys and accepting an API key with queries, but we're not tracking usage etc and new users can't sign up for one. We want to add the ability for users to create an account and sign up for an API key, track how many times a key has been used, apply rate limits, etc. Currently we have absolutely no persistence (by design), so we need to add some kind of data store to allow us to store users/tokens.

Our ambition for this project is to keep it a fairly thin in this respect, so we don't want to store a lot of data - for the most part we want upstream APIs to handle data model/storage. Its best thought of as one table. 100 user accounts is probably ambitious but we'll check/validate it on every request so potentially get a lot of traffic (but we can assume our entire data model comfortably fits in memory, on the most modest hardware). One of the nice things about devs.DC is it runs on lambda, so we don't have to worry about scaling.

In terms of tracking usage/rate limits/etc, we use Redis/elasticache for this on WDIV and I'd envisage probably replicating that approach here.

Options:

  1. Use RDS Pros: easy, familiar, we know exactly how to use it because we're using it everywhere, plays nice with django ORM Cons: expensive to run a dedicated instance all the time/overkill for one or 2 small tables?, requires downtime to scale it up/down, another thing to scale/bottleneck on (could we solve a lot of this with cache though? Does that just turn it into a cache invalidation problem?)

  2. Use Dynamo DB: https://aws.amazon.com/dynamodb/ (talk to it via boto not django ORM) Pros: serverless: cheap when we're not using it, but if facebook shove a link to us in every UK user's timeline while Chris is asleep, it will handle it via magic Cons: not used it before/learning curve, probably a PITA set up local dev environment Relevant blog: https://read.acloud.guru/why-amazon-dynamodb-isnt-for-everyone-and-how-to-decide-when-it-s-for-you-aefc52ea9476

  3. Store the users data in redis Pros: really fast/probably not going to bottleneck on it. We need to set it up to store the usage for keys/rate limit tracking/etc anyway Cons: Worried about data loss. Redis is a cache not a datastore. You can persist, but its not ACID-compliant. Is the time we'd spend on bashing redis to do persistence better spent elsewhere? Relevant blog: https://muut.com/blog/technology/redis-as-primary-datastore-wtf.html

  4. Crazy homebrew solution (e.g: store the users table in a JSON file on S3) Pros/cons: depends on the nature of the crazy homebrew solution

  5. Amazon API Gateway: https://aws.amazon.com/api-gateway/ TODO: research this to find out what it does/whether it solves this problem/what are the pros/cons? Cons: likely requires a BC break for existing users

Secondary consideration - logs/usage stats:

We'd also like to store more usage/logging data. At the moment we're mostly relying on cloudwatch logs and delegating stats to upstream APIs. Some of these solutions obviously don't lend themselves well to storing additional usage/log data, but others might be good for that (e.g: RDS). Happy to pick the right solution for API key signup and continue to deal with usage tracking as a log interrogation problem.

chris48s commented 5 years ago

another possibility: https://aws.amazon.com/blogs/aws/amazon-aurora-postgresql-serverless-now-generally-available/

symroe commented 2 years ago

Update on this issue:

AWS Lambda has a version of SQLite that's not supported by Django, meaning our previous method of having "no" database as a quick win became complex, or meant we had to stick with an unsupported Django version.

For other reasons, we have prod and stage RDS databases that's meant for low use applications (website, other ad-hoc projects). We've moved to using these RDS instances (in their own postgres database) for this project, but at the moment the project isn't using the database to serve postcode look up traffic (it might be used on Django start up, but basically it's not a scaling issue).

Because of this, we could consider this issue closed. I'm leaving it open for now though, as we wouldn't want to add things like API keys that make DB requests without considering how this would scale.

chris48s commented 2 years ago

You may (or may not) find this is touching the DB more than you think. Django's default session engine is DB so you might actually call the DB in the context of a request even if there's no explicit calls to request.session? https://github.com/DemocracyClub/aggregator-api/blob/d5be0a6e31491f9d95b38a6f2a239e9805528a9d/aggregator/settings/base.py#L38 I've done no research on this, but maybe check that..

I remember when I first set this project up, I did try to get django running with no DB at all and failed, but I have more recently seen a DB-less django config working. This bog post has an example: https://adamj.eu/tech/2020/10/15/a-single-file-rest-api-in-django/ so it miiiight be worth trying to revisit that although I dunno if you could strip it back to a DB-less config and retain 100% of the functionality this project now has. Maybe not worth the yak shave if you plan to actually store data any time soon though.

symroe commented 2 years ago

Django's default session engine is DB

Good thought, andthe middleware is installed, but django.contrib.sessions isn't in INSTALLED_APPS, so it's doing nothing.

I've checked this with django-extensions's runserver_plus --print-sql: none of the views or API endpoints make any database queries :tada: