Reduce Time and Cost for Database migrations

zachrog commented 1 month ago

Pre-Conditions

Acceptance Criteria

The cost of performing a migration is reduced.
The speed of a migration is increased.
The stability of the migration process is maintained.

Pain Avoided/Frustration Saved

High $$$ cost of doing a migration
Complex infrastructure and process for migrations
Migrations can take a long time depending on the data size.

Breadth/Pervasiveness of Problem

Every time a migration occurs we need to spend hundreds of dollars to perform it and it may take ~8 hours to do.

Complexity of Problem (Low, Medium, High) and Why it's Complex

High complexity as migrations can involve infrastructure, code changes, and deployment strategies

Notes

codyseibert commented 2 days ago

although this is a bit unrelated to the story, I do think it's important to understand migration benchmarks on well established databases such as postgres so we have a baseline to compare to when we talk about migrations in dynamo. I know sql can run fast based on what I've read online, but I didn't really have metrics to confirm my assumption.

To verify how long a migration actually takes in SQL, I setup an AWS RDS instance using a single postgres instance (no multi-az, smallest machine with 1000 IOPS, 2cpu 8gb ram). After initialization, I created a docket_entries table and added all the necessary columns to hold a single docket entry record. Next, I created 10,000,000 docket entries by running a single sql statement which took around 2 minutes.

next, I want to check how long it would take to count every single record in that table. Running a simple select count(*) query took 2 seconds.

Next, wanted to verify how long it would take to run a migration to update all of the records in that table to change a document_type from 'MISC' to 'O'.

The point of this experiment was to verify my hunch that SQL can easily handle our dataset, and it would dramatically reduce the time to run a migration script which modifies all records in a table. Additionally, this was a worst case scenario migration, often our migration scripts involves only updating a subset of rows which would mean doing an update with a WHERE clause would be even faster. This test also didn't include using database indexes which could affect write performance but improve read performance.

Given the fact that a migration takes almost 2-3 hours in dynamo, switching to postgres may simplify a lot of our system. It is worth mentioning that running this postgres instance estimated around $150 a month, so having a separate instance for each environment may result in much higher operational costs. It is possible to host multiple schemas on the same postgres database, or have multiple tables with a $ENV_ prefix inside a postgres database if reducing costs is necessary.

codyseibert commented 6 hours ago

I have been experimenting with another approach by creating a GSI on the entityName and also the SK of the document. It seems like we use the SK in many entities as the unique identifier, so this allows us to query for all the records of a certain type. In order to speed this up, one much run many concurrent dynamo queries against the index using a begins_with clause. For example, if you want to query for all the cases, you need to make a query for the following

entityName: 'Case' sk: case|10 entityName: 'Case' sk: case|20 entityName: 'Case' sk: case|30

etc.

to achieve this I created this script which runs 100 queries concurrently and counts all of the results. It ends up using 9.2 GB of memory on my machine. Here is the script for future reference:

import { DynamoDBClient, QueryCommand } from '@aws-sdk/client-dynamodb';

const client = new DynamoDBClient({
  region: 'us-east-1',
});

const TABLE_NAME = 'efcms-test-alpha';
const INDEX_NAME = 'entityName-sk-index';

async function queryPair(pair: string) {
  let results: any[] = [];
  let lastEvaluatedKey: { [key: string]: any } | undefined = undefined;
  console.log('fetching', pair);

  do {
    const command = new QueryCommand({
      ExclusiveStartKey: lastEvaluatedKey,
      ExpressionAttributeNames: {
        '#pk': 'entityName',
        '#sk': 'sk',
      },
      ExpressionAttributeValues: {
        ':pk': { S: 'Case' },
        ':prefix': { S: `case|${pair}` },
      },
      IndexName: INDEX_NAME,
      KeyConditionExpression: '#pk = :pk and begins_with(#sk, :prefix)',
      TableName: TABLE_NAME,
    });
    const response = await client.send(command);

    results = results.concat(response.Items || []);
    lastEvaluatedKey = response.LastEvaluatedKey;
  } while (lastEvaluatedKey);

  return results;
}

const generateUuidPairs = (): string[] => {
  const chars = '0123456789'; //abcdef'; // append the abcdef for uuids
  const pairs: string[] = [];

  for (let i = 0; i < chars.length; i++) {
    for (let j = 0; j < chars.length; j++) {
      pairs.push(chars[i] + chars[j]);
    }
  }

  return pairs;
};

// Usage
const uuidPairs = generateUuidPairs();

Promise.all(uuidPairs.map(queryPair))
  .then(results => {
    console.log(results.length);
    const flat = results.flat();
    console.log(flat.length);
  })
  .catch(console.error);

this script took 319s to run which resulted in a count of 1073928 which also matched the index count on elasticsearch. I think this approach would allow us to target certain subsets of our dataset and patch that subset much faster than our parallel scan approach which checks everything. Also it seems like doing all of this on a single powerful machine might be much faster than splitting this out into various lambda functions.

flexion / ef-cms