Fix issues with pushing to ITRB clusters

gaurav commented 6 months ago

This will need to be fixed in the Translator-Devops repo at https://github.com/helxplatform/translator-devops/issues/813

gaurav commented 5 months ago

Notes:

Started creating a Redis 6.2.6 Elasticache around 2:10pm ET
It was ready to go around 2:24pm ET
I used encrypt-in-transit to get AUTH to work (had to restart Elasticache)
I'm using redis-cli 6.2.14
(many curses later)
I'm using redis-cli 6.0.16 on Amazon EC2
I can replicate the situation: test2-0001-001 has 106k entries, test2-0002-001 has 0k entries

(Jan 12, 2024 at 4:53pm)

On Friday at 4:53pm I tried to restore an ElastiCache server with the same settings (2 shards, 2 nodes per shard) from the id-categories.rdb backup stored on S3. Weirdly, the S3 bucket path is rendered as nodenorm-2023nov5-id-categories/id-categories.rdb rather than an obviously S3 URL.
- Failed at 5:05pm, trying again with a Serverless instance at 5:06pm.
- Failed at 5:39pm, "Failed to create cache nn-id-categories-2023nov5-serverless. Data restoration from snapshot failed because failed to retrieve file from S3.." This blog post might have some clues as to what went wrong.

Aha, my bucket wasn't publicly accessible. I added the following to the Bucket permissions in JSON:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": "*",
      "Action": "s3:GetObject",
      "Resource": "arn:aws:s3:::nodenorm-2023nov5-id-categories/id-categories.rdb"
      }
    }
  ]
}

Confirmed that the file was now downloadable. And started a new Serverless ElastiCache at 5:54pm.

(Jan 12, 2024 at 5:57pm)

In other news, my first ElastiCache has cost my $75, so I'm pulling the plug on it. It has loaded so far:

ubuntu@ip-172-31-26-53:~$ redis-cli --tls -h test2-0001-001.test2.cq5uuk.use1.cache.amazonaws.com -p 6379 DBSIZE
(integer) 132881946
ubuntu@ip-172-31-26-53:~$ redis-cli --tls -h test2-0002-001.test2.cq5uuk.use1.cache.amazonaws.com -p 6379 DBSIZE
(integer) 131760211

So... it looks like it's working.

gaurav commented 5 months ago

This is probably happening because of this issue: https://github.com/redis/redis/issues/6098

Options:

Presplit the RDB file somehow using the hashes (maybe if we split our cluster in the same way we'll get the same hashes?)
There's probably some way to write code to connect rdbtools' ProtocolCallback with redis-cli's source code and create a piece of code that can read in a RDB file, then
When creating a Redis cluster, there is an option for restoring from backup with "Path to a Redis RDB backup stored in Amazon S3 to seed your cache." So ITRB could potentially create new clusters for all the databases, restore them from backup on setup, wait for the restore to finish, then we could switch over from the old clusters to the new ones, test, and then delete the old clusters.
- Actually, pushing NodeNorm/NameRes files to Amazon S3 would probably help the download speed significantly anyway.
Literally just run the same job across all the master nodes... this will increase the jobs we need from 7 to 11, which isn't a huge increase, and should take about the same amount of time assuming the jobs all start at the same time. And I can adapt Yaphet's existing code to do this.

We can figure out the list of master clusters by running something like:

ubuntu@ip-172-31-26-53:~$ redis-cli --tls -h clustercfg.test2.cq5uuk.use1.cache.amazonaws.com -p 6379 CLUSTER NODES | grep master
7028fec7d3b97480e91d51f47a1cba1664999087 test2-0001-001.test2.cq5uuk.use1.cache.amazonaws.com:6379@1122 myself,master - 0 1705011272000 2 connected 0-8191
5cf01b74b4728a3865b04c9dce46b879c5400439 test2-0002-001.test2.cq5uuk.use1.cache.amazonaws.com:6379@1122 master - 0 1705011272000 1 connected 8192-16383
ubuntu@ip-172-31-26-53:~$ redis-cli --tls -h clustercfg.test2.cq5uuk.use1.cache.amazonaws.com -p 6379 CLUSTER NODES | grep master | cut -d' ' -f2 | cut -d':' -f1
test2-0001-001.test2.cq5uuk.use1.cache.amazonaws.com
test2-0002-001.test2.cq5uuk.use1.cache.amazonaws.com

I'm trying this out on EC2 by running:

$ rdb -c protocol id_to_type_db.rdb | redis-cli -c -h test2-0001-001.test2.cq5uuk.use1.cache.amazonaws.com -p 6379 --pipe --tls
$ rdb -c protocol id_to_type_db.rdb | redis-cli -c -h test2-0002-001.test2.cq5uuk.use1.cache.amazonaws.com -p 6379 --pipe --tls

So far so good. It is EXTRAORDINARILY SLOW (approaching 24 hours!) but that might just be because we have a tiny instance.

gaurav commented 5 months ago

Huge success! ElastiCache was able to load all 434,397,820 keys into a Serverless ElastiCache Redis 7.1 database. The database was created at January 12, 2024, 17:52:47 (UTC-05:00) and backup completed at January 12, 2024, 19:06:52 (UTC-05:00), so 1h14m, approximately as long as expected.

I don't know if it makes financial sense to switch us over to a Serverless database instead of using the custom clusters, but I'm guessing... yes? Regardless, this restore should work for custom clusters too.

gaurav commented 4 months ago

This issue now solved by the slow and inefficient method of uploading the data to all the nodes (i.e. all 400M+ entries are uploaded to node 1, then node 2, and so on). Our total load time is approximately 3 hours because we have to loop through all three databases.

We can probably improve this further by downloading the Hashmap (i.e. come up with a Python-based solution for https://github.com/redis/redis/issues/6098), but a better solution for our needs would be with rolling Serverless updates: https://github.com/TranslatorSRI/NodeNormalization/issues/252 -- I'll open tickets for that if ITRB doesn't want to do rolling Serverless updates, but not before.

TranslatorSRI / NodeNormalization

Fix issues with pushing to ITRB clusters #242