Closed DFurnes closed 4 years ago
It will update existing links or clicks if run multiple times, so it's safe to re-run before and after we swap production traffic.
Does this mean that it will update but not duplicate already existing links/clicks?
Does this mean that it will update but not duplicate already existing links/clicks?
Correct! If it's, say, re-importing dosome.click/na9wqk
, it'd just overwrite the record with { key: "na9wqk" }
for the link & any click records based on their UUID (from the click_id
column).
I checked in to see how this script was doing on QA: over the past 24 hours or so, we'd only migrated 74,294 shortlinks. 🐌 That seemed awfully sluggish (especially considering that we have millions of links to shorten).
After dropping a few console.time
loggers through the script, it was clear that we were spending too much time in PostgreSQL – multiple seconds per link. This was because, despite what this migration might argue, we were missing an index on shortened
in production!
I created this index by hand to see if we could speed things up a bit:
CREATE INDEX shortened_index ON clicks (shortened);
And so far so good – after just 30 minutes w/ the index, we've already migrated 72,753 links! 🐎
Okay – made a few more changes after letting this run a few times over the weekend:
🔄 I swapped from ioredis
to redis
because it lets us set our cursor
for the SCAN
operation (so we can restart if necessary). For readability, I wrote a helper generator function that handles cursor logic. (Unlike the library that I took inspiration from, this code uses a for
loop rather than recursion so we don't overflow Node's call stack.)
↔️ I split the script into separate migrate-links
and migrate-clicks
scripts. This let me simplify each script's logic a bit, but also saves a lot of time – most links have zero clicks (and we have a lot of links), and so it saves us time to bypass those SELECT
queries for click-less links.
⏱️ I removed unnecessary await
s from the script (so we just continually fire off requests to DynamoDB and S3, rather than waiting for each request to complete sequentially). This allowed us to increase the script's throughput from 209,500 links/hour to 1,445,556 links/hour.
@DFurnes I'd already pre-approved this (you have excellent Code Credit Rating 💳 ), but having looked at the latest changes, consider your PR re-approved!
:shipit:
What's this PR do?
This pull request adds a script to migrate our existing shortlinks (in Redis) and clicks (in PostgreSQL) into DynamoDB/S3. It will update existing links or clicks if run multiple times, so it's safe to re-run before and after we swap production traffic.
How should this be reviewed?
Give the script a look! Any edge cases I may have missed?
Any background context you want to provide?
I'm testing this out against QA on an EC2 instance that Sena very kindly provisioned for me. Once we're ready to make a production Bertly 2.0 instance, we can use the same box to run this there:
Relevant tickets
References Pivotal #165507221.
Checklist