TranslatorSRI / NodeNormalization

Service that produces Translator compliant nodes given a curie
MIT License
9 stars 6 forks source link

Fix issues with pushing to ITRB clusters #242

Closed gaurav closed 4 months ago

gaurav commented 6 months ago

This will need to be fixed in the Translator-Devops repo at https://github.com/helxplatform/translator-devops/issues/813

gaurav commented 5 months ago

Notes:

(Jan 12, 2024 at 4:53pm)

Aha, my bucket wasn't publicly accessible. I added the following to the Bucket permissions in JSON:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": "*",
      "Action": "s3:GetObject",
      "Resource": "arn:aws:s3:::nodenorm-2023nov5-id-categories/id-categories.rdb"
      }
    }
  ]
}

Confirmed that the file was now downloadable. And started a new Serverless ElastiCache at 5:54pm.

(Jan 12, 2024 at 5:57pm)

So... it looks like it's working.

gaurav commented 5 months ago

This is probably happening because of this issue: https://github.com/redis/redis/issues/6098

Options:

  1. Presplit the RDB file somehow using the hashes (maybe if we split our cluster in the same way we'll get the same hashes?)
  2. There's probably some way to write code to connect rdbtools' ProtocolCallback with redis-cli's source code and create a piece of code that can read in a RDB file, then
  3. When creating a Redis cluster, there is an option for restoring from backup with "Path to a Redis RDB backup stored in Amazon S3 to seed your cache." So ITRB could potentially create new clusters for all the databases, restore them from backup on setup, wait for the restore to finish, then we could switch over from the old clusters to the new ones, test, and then delete the old clusters.
    • Actually, pushing NodeNorm/NameRes files to Amazon S3 would probably help the download speed significantly anyway.
  4. Literally just run the same job across all the master nodes... this will increase the jobs we need from 7 to 11, which isn't a huge increase, and should take about the same amount of time assuming the jobs all start at the same time. And I can adapt Yaphet's existing code to do this.

We can figure out the list of master clusters by running something like:

ubuntu@ip-172-31-26-53:~$ redis-cli --tls -h clustercfg.test2.cq5uuk.use1.cache.amazonaws.com -p 6379 CLUSTER NODES | grep master
7028fec7d3b97480e91d51f47a1cba1664999087 test2-0001-001.test2.cq5uuk.use1.cache.amazonaws.com:6379@1122 myself,master - 0 1705011272000 2 connected 0-8191
5cf01b74b4728a3865b04c9dce46b879c5400439 test2-0002-001.test2.cq5uuk.use1.cache.amazonaws.com:6379@1122 master - 0 1705011272000 1 connected 8192-16383
ubuntu@ip-172-31-26-53:~$ redis-cli --tls -h clustercfg.test2.cq5uuk.use1.cache.amazonaws.com -p 6379 CLUSTER NODES | grep master | cut -d' ' -f2 | cut -d':' -f1
test2-0001-001.test2.cq5uuk.use1.cache.amazonaws.com
test2-0002-001.test2.cq5uuk.use1.cache.amazonaws.com

I'm trying this out on EC2 by running:

$ rdb -c protocol id_to_type_db.rdb | redis-cli -c -h test2-0001-001.test2.cq5uuk.use1.cache.amazonaws.com -p 6379 --pipe --tls
$ rdb -c protocol id_to_type_db.rdb | redis-cli -c -h test2-0002-001.test2.cq5uuk.use1.cache.amazonaws.com -p 6379 --pipe --tls

So far so good. It is EXTRAORDINARILY SLOW (approaching 24 hours!) but that might just be because we have a tiny instance.

gaurav commented 5 months ago

Huge success! ElastiCache was able to load all 434,397,820 keys into a Serverless ElastiCache Redis 7.1 database. The database was created at January 12, 2024, 17:52:47 (UTC-05:00) and backup completed at January 12, 2024, 19:06:52 (UTC-05:00), so 1h14m, approximately as long as expected.

I don't know if it makes financial sense to switch us over to a Serverless database instead of using the custom clusters, but I'm guessing... yes? Regardless, this restore should work for custom clusters too.

gaurav commented 4 months ago

This issue now solved by the slow and inefficient method of uploading the data to all the nodes (i.e. all 400M+ entries are uploaded to node 1, then node 2, and so on). Our total load time is approximately 3 hours because we have to loop through all three databases.

We can probably improve this further by downloading the Hashmap (i.e. come up with a Python-based solution for https://github.com/redis/redis/issues/6098), but a better solution for our needs would be with rolling Serverless updates: https://github.com/TranslatorSRI/NodeNormalization/issues/252 -- I'll open tickets for that if ITRB doesn't want to do rolling Serverless updates, but not before.