Prepare a SOP for making manual backups of the Redis graphs

gaurav commented 1 year ago

We've now seen several instances of a single Redis instance getting corrupted (e.g. #159), forcing us to delete all six Redis tables and reloading all of them from scratch. One way to avoid this situation would be to back up all six Redis tables to disk and copy them over to Hatteras. That way, if we have a failure in both the primary and backup RENCI NodeNorm like we did on 2023-Jan-20, we will be able to restore the Redis instances from those backups rather than having to reload from the Babel files.

@YaphetKG also suggested that the problem might be that the Redis instances aren't writing their databases to disk properly -- if so, then backing them up might also cause the Redis instance to flush its contents to disk. Furthermore, we only need the Redis instances to be writeable while the loader is running -- once that's complete, we would prefer to put all the Redis instances into read-only mode somehow.

In the future, it might also be more efficient to set up the Redis instances on ITRB by transmitting the RDB files rather than our current strategy of starting jobs on ITRB to download Babel files from RENCI and load them into ITRB.

Steps needed:

[x] Examine the current nn-redis-2022dec2 instances on translator-dev.
- [x] Does the written file pass checking with redis-check-rdb?
- [x] Is there some way to tell the Redis instance to flush their databases to disk?
- [x] Can we copy these RDB files to Hatteras, then use them to start a new Redis instance in translator-exp without having to reload them?
[x] Examine the current Redis persistence settings and see if they are appropriate for our needs.
[x] Set up and load Redis instances on translator-exp with enough space to write out a separate backup file. Determine if this file is smaller than the RDB files.

gaurav commented 1 year ago

If we want to keep using the RDB file, we would need to set up something like this:

For each Redis server:
- Send the server a BGSAVE SCHEDULE command.
- Keep polling the Redis server with LASTSAVE until the last save time changes.
- Use kubectl cp to copy the /data/dump.rdb file into Hatteras somewhere.
- Use redis-check-rdb to make sure the copied RDB file is valid.

gaurav commented 1 year ago

This has been fixed in https://github.com/helxplatform/translator-devops/pull/651. Closing.

TranslatorSRI / NodeNormalization

Prepare a SOP for making manual backups of the Redis graphs #160