Write Redis/PostgreSQL to DynamoDB/S3 script.

DFurnes commented 4 years ago

What's this PR do?

This pull request adds a script to migrate our existing shortlinks (in Redis) and clicks (in PostgreSQL) into DynamoDB/S3. It will update existing links or clicks if run multiple times, so it's safe to re-run before and after we swap production traffic.

How should this be reviewed?

Give the script a look! Any edge cases I may have missed?

Any background context you want to provide?

I'm testing this out against QA on an EC2 instance that Sena very kindly provisioned for me. Once we're ready to make a production Bertly 2.0 instance, we can use the same box to run this there:

Screen Shot 2020-06-04 at 12 50 51 PM

Relevant tickets

References Pivotal #165507221.

Checklist

[ ] This PR has been added to the relevant Pivotal card.
[ ] Added appropriate feature/unit tests.

katiecrane commented 4 years ago

It will update existing links or clicks if run multiple times, so it's safe to re-run before and after we swap production traffic.

Does this mean that it will update but not duplicate already existing links/clicks?

DFurnes commented 4 years ago

Does this mean that it will update but not duplicate already existing links/clicks?

Correct! If it's, say, re-importing dosome.click/na9wqk, it'd just overwrite the record with { key: "na9wqk" } for the link & any click records based on their UUID (from the click_id column).

DFurnes commented 4 years ago

I checked in to see how this script was doing on QA: over the past 24 hours or so, we'd only migrated 74,294 shortlinks. 🐌 That seemed awfully sluggish (especially considering that we have millions of links to shorten).

After dropping a few console.time loggers through the script, it was clear that we were spending too much time in PostgreSQL – multiple seconds per link. This was because, despite what this migration might argue, we were missing an index on shortened in production!

I created this index by hand to see if we could speed things up a bit:

CREATE INDEX shortened_index ON clicks (shortened);

And so far so good – after just 30 minutes w/ the index, we've already migrated 72,753 links! 🐎

DFurnes commented 4 years ago

Okay – made a few more changes after letting this run a few times over the weekend:

🔄 I swapped from ioredis to redis because it lets us set our cursor for the SCAN operation (so we can restart if necessary). For readability, I wrote a helper generator function that handles cursor logic. (Unlike the library that I took inspiration from, this code uses a for loop rather than recursion so we don't overflow Node's call stack.)

↔️ I split the script into separate migrate-links and migrate-clicks scripts. This let me simplify each script's logic a bit, but also saves a lot of time – most links have zero clicks (and we have a lot of links), and so it saves us time to bypass those SELECT queries for click-less links.

⏱️ I removed unnecessary awaits from the script (so we just continually fire off requests to DynamoDB and S3, rather than waiting for each request to complete sequentially). This allowed us to increase the script's throughput from 209,500 links/hour to 1,445,556 links/hour.

sheyd commented 4 years ago

@DFurnes I'd already pre-approved this (you have excellent Code Credit Rating 💳 ), but having looked at the latest changes, consider your PR re-approved!

:shipit:

DoSomething / bertly