dragonresearch / rpki.net

Dragon Research Labs rpki.net RPKI toolkit
54 stars 26 forks source link

Ticket to track Migration of a Root CA #813

Open sraustein opened 8 years ago

sraustein commented 8 years ago

is it safe to go through the root creation, pubd credentials, ... and patch in the migrated root key and resultant TAL later and then migrate the resources and users?

Trac ticket #807 component rpkid priority major, owner None, created by randy on 2016-04-25T01:02:57Z, last modified 2016-05-10T18:06:40Z

sraustein commented 8 years ago

is it safe to go through the root creation, pubd credentials, ... and patch in the migrated root key and resultant TAL later and then migrate the resources and users?

Define "safe".

So long as you don't publish the TAL you use during testing, nobody but us will ever know about it, so at worst you might have to apt-get purge then reinstall.

Trac comment by sra on 2016-04-25T01:25:31Z

sraustein commented 8 years ago

While I had originally expected to do this sort of transition using Django migrations, this is a big enough jump and likely enough to involve multiple machines that I think there's a simpler solution:

As a refinement, we might run the pickle through some compression program, both for size and, more importantly, for some kind of internal checksum to detect transfer errors while moving the pickle around. Heck, we could use gpg to wrap it, but let's not get carried away.

It turns out that rendering the contents of /etc/rpki.conf, a collection of MySQL databases, and disk files indicated by /etc/rpki.conf as Python dict() objects is not particularly hard. There's some redundancy (particularly if one uses the optional feature in the MySQLdb API that returns each table row as a dict()), but Pickle format is good at identifying common objects, so it's not particularly wasteful except for a bit of CPU time while generating the pickle.

While this could be generalized into some kind of back-up-the-entire-CA mechanism, that would be mission creep. At the moment, I'm focused on a specific nasty transition, which includes the raw-MySQL to Django ORM jump, which is enough of a challenge for one script.

Another thing I like about this besides its (relative) simplicity is that one can save the intermediate format. Assuming we can get keep the part that generates the pickle simple enough, it should be straightforward to reassure ourselves that it has all the data we intended to save. Given that, we can isolate the more complex problem (unpacking the data into the new database) as a separate task, which we can run repeatedly until we get it right if that's what it takes: so long as the pickle is safe, no data has been lost.

Yes, of course we also tell the user to back up every freaking thing possible in addition to generating the pickle, even though we hope and intend that the pickle contains everything we need.

This scheme does assume that everything in a CA instance will fit in memory. That's not a safe assumption in the general case, but I think it's safe for everything we're likely to care about for this particular transition given state of play to date. There are variants on this scheme we could use if this were a problem, but I don't think it is.

Trac comment by sra on 2016-04-27T13:41:41Z

sraustein commented 8 years ago

for transfer check, just sha1 it on both ends

do not care about efficiency. one does not do this daily.

being as tolerant of input issues on the /trunk side may be helpful. i have spared you a lot of horrifying logs, for example

{{{ Apr 26 00:08:40 ca0 rpkid[731]: MySQL exception with dirty objects in SQL cache! Apr 26 00:08:40 ca0 rpkid[731]: Unhandled exception processing up-down request: OperationalError: (2006, 'MySQL server has gone away') Apr 26 00:08:40 ca0 rpkid[731]: Unhandled exception processing up-down request: OperationalError: (2006, 'MySQL server has gone away') Apr 26 00:09:07 ca0 rpkid[731]: MySQL exception with dirty objects in SQL cache! Apr 26 00:09:07 ca0 rpkid[731]: Unhandled exception processing up-down request: OperationalError: (2006, 'MySQL server has gone away') Apr 26 00:09:07 ca0 rpkid[731]: Unhandled exception processing up-down request: OperationalError: (2006, 'MySQL server has gone away') }}}

not sure we need backup entire CA, as we can back up machine. as you can see from above, audit and fix CA might be useful.

i think the largest dataset that would migrate would be jpnic or cnnic.

Trac comment by randy on 2016-04-27T14:15:07Z

sraustein commented 8 years ago

for transfer check, just sha1 it on both ends

Piping through "xz -C sha256" automates this.

do not care about efficiency. one does not do this daily.

Right.

being as tolerant of input issues on the /trunk side may be helpful.

Input side is just data capture, no analysis.

OperationalError: (2006, 'MySQL server has gone away')

You've been getting that on and off for years, with various causes, most commonly port upgrade turning mysqld off but not turning it back on. The other error messages quoted cascade from that.

not sure we need backup entire CA, as we can back up machine. as you can see from above, audit and fix CA might be useful.

I think we are quibbling about "entire CA". Intent is to capture data that needs to be in place on the new server to continue operation, along with some minor config data which we may never need but which is easiest to capture at the same time (eg, funny settings in rpki.conf).

i think the largest dataset that would migrate would be jpnic or cnnic.

Seems likely.

Trac comment by sra on 2016-04-27T20:48:49Z

sraustein commented 8 years ago

In [changeset:"6395" 6395]: {{{

!CommitTicketReference repository="" revision="6395"

First step of transition mechanism from trunk/ to tk705/: script to encapsulate all (well, we hope) relevant configuration and state from a trunk/ CA in a form we can easily load on another machine, or on the same machine after a software upgrade, or ....

Transfer format is an ad hoc Python dictionary, encoded in Python's native "Pickle" format, compressed by "xz" with SHA-256 integrity checking enabled. See #807. }}}

Trac comment by sra on 2016-04-27T22:20:20Z

sraustein commented 8 years ago

except the mysql server is running. and if i restart mysql-server, the error persists. this is a rabbit hole best left unexplored if possible. focus on migration.

it is not that i have not tried {{{ ca0.rpki.net:/root# mysql -p Enter password: Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 13 Server version: 5.5.49 Source distribution

Copyright (c) 2000, 2016, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> Bye ca0.rpki.net:/root# tail /var/log/messages Apr 27 23:04:49 ca0 last message repeated 2 times Apr 27 23:05:45 ca0 sshd[12136]: Connection closed by 198.180.150.1 [preauth] Apr 27 23:05:45 ca0 sshguard[28474]: 198.180.150.1: should already have been blocked Apr 27 23:06:17 ca0 rpkid[731]: MySQL exception with dirty objects in SQL cache! Apr 27 23:06:17 ca0 rpkid[731]: Unhandled exception processing up-down request: OperationalError: (2006, 'MySQL server has gone away') Apr 27 23:06:49 ca0 rpkid[731]: cron keepalive threshold 2016-04-27T23:06:48Z has expired, breaking lock Apr 27 23:06:49 ca0 rpkid[731]: MySQL exception with dirty objects in SQL cache! Apr 27 23:06:49 ca0 rpkid[731]: Unhandled exception processing up-down request: OperationalError: (2006, 'MySQL server has gone away') Apr 27 23:08:17 ca0 rpkid[731]: MySQL exception with dirty objects in SQL cache! Apr 27 23:08:17 ca0 rpkid[731]: Unhandled exception processing up-down request: OperationalError: (2006, 'MySQL server has gone away') }}}

Trac comment by randy on 2016-04-27T23:09:28Z

sraustein commented 8 years ago

I'm considering taking advantage of this pickled migration process to make one schema change which appears to be beyond the capabilities of Django's migration system.

Task:: Fold rpki.irdb.models.Turtle model back into rpki.irdb.models.Parent now that rpki.irdb.models.Rootd is gone.

Users Affected:: Users of current tk705/ branch (me, Randy, Michael) may have to drop databases and rebuild, possibly losing all current data. OK, we could fix that too, but probably not worth the trouble for just the three of us and not yet anything in production.

Details:: Current table structure was complicated to allow a Repository to link to either a Parent or a Rootd. Now that we no longer need (or have) Rootd, we no longer need this complexity. Django ORM migrations throw up their hands and whimper when asked to make a change this fundamental to a SQL primary index column (I've tried, boy howdy how I have tried), but the pickled migration code doesn't care, because it doesn't need to modify SQL tables in place.

When:: If we're ever going to do this, we should do it now, before anybody else is using this code. Once we have external users, we're stuck with the mess.

Any objections to making this change, before Randy attempts to move ca0.rpki.net?

Trac comment by sra on 2016-04-29T03:20:30Z

sraustein commented 8 years ago

On Fri, Apr 29, 2016 at 03:20:31AM -0000, Trac Ticket System wrote:

Any objections to making this change, before Randy attempts to move ca0.rpki.net?

Sounds good to me.

Trac comment by melkins on 2016-04-29T16:21:33Z

sraustein commented 8 years ago

sure

Trac comment by randy on 2016-04-29T22:49:22Z

sraustein commented 8 years ago

uh, any progress?

Trac comment by randy on 2016-05-05T07:13:26Z

sraustein commented 8 years ago

One last bug....

Trac comment by sra on 2016-05-05T11:52:16Z

sraustein commented 8 years ago

OK, in theory it's ready for an alpha tester.

This is a two stage process: the first stage runs on the machine you're evacuating, the second runs on the destination machine. This is deliberate, and should allow you to leave the old machine safely idle in case something goes horribly wrong and you need to revert.

In addition to the usual tools, you need two scripts:

You can fetch these using svn if you want to pull the whole source tree, or just fetch the individual scripts with wget, fetch, ....

On the old machine:

On the new machine:

Notes on the long script above:

As to what's really going on here:

Trac comment by sra on 2016-05-06T01:04:34Z

sraustein commented 8 years ago

how long is {{{ ca0.rpki.net:/root# python ca-pickle.py pickled-rpki.xz }}}

expected to run?

it's been maybe 15 minutes. and mysql-server is running

{{{ ca0.rpki.net:/root# mysql -p Enter password: Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 2940 Server version: 5.5.49 Source distribution

Copyright (c) 2000, 2016, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> quit; Bye }}}

Trac comment by randy on 2016-05-07T06:28:01Z

sraustein commented 8 years ago

ignore. it finally finished.

Trac comment by randy on 2016-05-07T06:32:33Z

sraustein commented 8 years ago

Out of curiosity, please post size of the xz file.

I don't think I've seen ca-pickle take more than five or ten seconds, but I was testing it with small data sets on a lightly loaded VM.

Trac comment by sra on 2016-05-07T06:39:04Z

sraustein commented 8 years ago

ca0.rpki.net:/root# l -h pickled-rpki.xz -rw------- 1 root wheel 7.1M May 7 06:28 pickled-rpki.xz

Trac comment by randy on 2016-05-07T06:40:00Z

sraustein commented 8 years ago

Removed confused instructions which led to #815. That part of the instructions was just plain wrong.

Trac comment by sra on 2016-05-09T05:44:11Z

sraustein commented 8 years ago

Added --root-handle argument to ca-unpickle, so you can do:

{{{ python ca-unpickle.py blarg.xz --rootd --root-handle Root }}}

so that the name of the entity created from the salvaged rootd data will be named "Root" instead of some randomly generated UUID.

If you already have an entity named "Root", this will fail with a SQL constraint violation when it discovers that you're creating a second Tenant with the same handle, but you want it to fail in such a case.

Trac comment by sra on 2016-05-09T17:55:55Z

sraustein commented 8 years ago

Noting something I figured out while writing a report for Sandy:

If we need to take this pickled database hack beyond what will easily fit in memory, one relatively simple way of breaking the problem up into chunks would be to use the Python shelve module with gdbm. So, eg, instead of one great big enormous pickle, we could pickle each SQL table in a separate slot of the shelve database; if necessary, we could break things down even smaller, but one shelf per table is an easy target.

Transfer format in this case would be a gdbm database, which we could then ship to another machine in portable format using the gdbm_dump and gdbm_load utilities, possibly compressed with xz for the same reasons we compress the current pickle format.

None of this is worth worrying about until and unless we hit a case which needs it, just making note of the technique while I remember it.

Trac comment by sra on 2016-05-10T18:06:40Z