sraustein commented 8 years ago

is it safe to go through the root creation, pubd credentials, ... and patch in the migrated root key and resultant TAL later and then migrate the resources and users?

Trac ticket #807 component rpkid priority major, owner None, created by randy on 2016-04-25T01:02:57Z, last modified 2016-05-10T18:06:40Z

sraustein commented 8 years ago

is it safe to go through the root creation, pubd credentials, ... and patch in the migrated root key and resultant TAL later and then migrate the resources and users?

Define "safe".

So long as you don't publish the TAL you use during testing, nobody but us will ever know about it, so at worst you might have to apt-get purge then reinstall.

Trac comment by sra on 2016-04-25T01:25:31Z

sraustein commented 8 years ago

While I had originally expected to do this sort of transition using Django migrations, this is a big enough jump and likely enough to involve multiple machines that I think there's a simpler solution:

On old machine (trunk/ code): read all relevant data (/etc/rpki.conf, MySQL tables, *.{cer,key} files reachable from rpki.conf) into Python memory as some kind of simple object (could be a custom class but probably simpler if it's just a dict/list/int/str thing like one would get from parsing JSON or YAML). Write all of that out as a single file, using Python's native "Pickle" format (Python-specific, but guaranteed to work with any version of Python on any hardware).
Copy the pickled data to wherever it needs to go.
On new machine (tk705/ code): read the pickle to reproduce the in-Python-memory structure, then drop data into Django ORM objects as needed. Might try to do something clever with suggesting /etc/rpki.conf changes based on comparison of what's set on new machine with what's in the pickle, but probably not.

As a refinement, we might run the pickle through some compression program, both for size and, more importantly, for some kind of internal checksum to detect transfer errors while moving the pickle around. Heck, we could use gpg to wrap it, but let's not get carried away.

It turns out that rendering the contents of /etc/rpki.conf, a collection of MySQL databases, and disk files indicated by /etc/rpki.conf as Python dict() objects is not particularly hard. There's some redundancy (particularly if one uses the optional feature in the MySQLdb API that returns each table row as a dict()), but Pickle format is good at identifying common objects, so it's not particularly wasteful except for a bit of CPU time while generating the pickle.

While this could be generalized into some kind of back-up-the-entire-CA mechanism, that would be mission creep. At the moment, I'm focused on a specific nasty transition, which includes the raw-MySQL to Django ORM jump, which is enough of a challenge for one script.

Another thing I like about this besides its (relative) simplicity is that one can save the intermediate format. Assuming we can get keep the part that generates the pickle simple enough, it should be straightforward to reassure ourselves that it has all the data we intended to save. Given that, we can isolate the more complex problem (unpacking the data into the new database) as a separate task, which we can run repeatedly until we get it right if that's what it takes: so long as the pickle is safe, no data has been lost.

Yes, of course we also tell the user to back up every freaking thing possible in addition to generating the pickle, even though we hope and intend that the pickle contains everything we need.

This scheme does assume that everything in a CA instance will fit in memory. That's not a safe assumption in the general case, but I think it's safe for everything we're likely to care about for this particular transition given state of play to date. There are variants on this scheme we could use if this were a problem, but I don't think it is.

Trac comment by sra on 2016-04-27T13:41:41Z

sraustein commented 8 years ago

for transfer check, just sha1 it on both ends

do not care about efficiency. one does not do this daily.

being as tolerant of input issues on the /trunk side may be helpful. i have spared you a lot of horrifying logs, for example

{{{ Apr 26 00:08:40 ca0 rpkid[731]: MySQL exception with dirty objects in SQL cache! Apr 26 00:08:40 ca0 rpkid[731]: Unhandled exception processing up-down request: OperationalError: (2006, 'MySQL server has gone away') Apr 26 00:08:40 ca0 rpkid[731]: Unhandled exception processing up-down request: OperationalError: (2006, 'MySQL server has gone away') Apr 26 00:09:07 ca0 rpkid[731]: MySQL exception with dirty objects in SQL cache! Apr 26 00:09:07 ca0 rpkid[731]: Unhandled exception processing up-down request: OperationalError: (2006, 'MySQL server has gone away') Apr 26 00:09:07 ca0 rpkid[731]: Unhandled exception processing up-down request: OperationalError: (2006, 'MySQL server has gone away') }}}

not sure we need backup entire CA, as we can back up machine. as you can see from above, audit and fix CA might be useful.

i think the largest dataset that would migrate would be jpnic or cnnic.

Trac comment by randy on 2016-04-27T14:15:07Z

sraustein commented 8 years ago

for transfer check, just sha1 it on both ends

Piping through "xz -C sha256" automates this.

do not care about efficiency. one does not do this daily.

Right.

being as tolerant of input issues on the /trunk side may be helpful.

Input side is just data capture, no analysis.

OperationalError: (2006, 'MySQL server has gone away')

You've been getting that on and off for years, with various causes, most commonly port upgrade turning mysqld off but not turning it back on. The other error messages quoted cascade from that.

not sure we need backup entire CA, as we can back up machine. as you can see from above, audit and fix CA might be useful.

I think we are quibbling about "entire CA". Intent is to capture data that needs to be in place on the new server to continue operation, along with some minor config data which we may never need but which is easiest to capture at the same time (eg, funny settings in rpki.conf).

i think the largest dataset that would migrate would be jpnic or cnnic.

Seems likely.

Trac comment by sra on 2016-04-27T20:48:49Z

sraustein commented 8 years ago

In [changeset:"6395" 6395]: {{{

!CommitTicketReference repository="" revision="6395"

First step of transition mechanism from trunk/ to tk705/: script to encapsulate all (well, we hope) relevant configuration and state from a trunk/ CA in a form we can easily load on another machine, or on the same machine after a software upgrade, or ....

Transfer format is an ad hoc Python dictionary, encoded in Python's native "Pickle" format, compressed by "xz" with SHA-256 integrity checking enabled. See #807. }}}

Trac comment by sra on 2016-04-27T22:20:20Z

sraustein commented 8 years ago

except the mysql server is running. and if i restart mysql-server, the error persists. this is a rabbit hole best left unexplored if possible. focus on migration.

it is not that i have not tried {{{ ca0.rpki.net:/root# mysql -p Enter password: Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 13 Server version: 5.5.49 Source distribution

Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> Bye ca0.rpki.net:/root# tail /var/log/messages Apr 27 23:04:49 ca0 last message repeated 2 times Apr 27 23:05:45 ca0 sshd[12136]: Connection closed by 198.180.150.1 [preauth] Apr 27 23:05:45 ca0 sshguard[28474]: 198.180.150.1: should already have been blocked Apr 27 23:06:17 ca0 rpkid[731]: MySQL exception with dirty objects in SQL cache! Apr 27 23:06:17 ca0 rpkid[731]: Unhandled exception processing up-down request: OperationalError: (2006, 'MySQL server has gone away') Apr 27 23:06:49 ca0 rpkid[731]: cron keepalive threshold 2016-04-27T23:06:48Z has expired, breaking lock Apr 27 23:06:49 ca0 rpkid[731]: MySQL exception with dirty objects in SQL cache! Apr 27 23:06:49 ca0 rpkid[731]: Unhandled exception processing up-down request: OperationalError: (2006, 'MySQL server has gone away') Apr 27 23:08:17 ca0 rpkid[731]: MySQL exception with dirty objects in SQL cache! Apr 27 23:08:17 ca0 rpkid[731]: Unhandled exception processing up-down request: OperationalError: (2006, 'MySQL server has gone away') }}}

Trac comment by randy on 2016-04-27T23:09:28Z

sraustein commented 8 years ago

I'm considering taking advantage of this pickled migration process to make one schema change which appears to be beyond the capabilities of Django's migration system.

Task:: Fold rpki.irdb.models.Turtle model back into rpki.irdb.models.Parent now that rpki.irdb.models.Rootd is gone.

Users Affected:: Users of current tk705/ branch (me, Randy, Michael) may have to drop databases and rebuild, possibly losing all current data. OK, we could fix that too, but probably not worth the trouble for just the three of us and not yet anything in production.

Details:: Current table structure was complicated to allow a Repository to link to either a Parent or a Rootd. Now that we no longer need (or have) Rootd, we no longer need this complexity. Django ORM migrations throw up their hands and whimper when asked to make a change this fundamental to a SQL primary index column (I've tried, boy howdy how I have tried), but the pickled migration code doesn't care, because it doesn't need to modify SQL tables in place.

When:: If we're ever going to do this, we should do it now, before anybody else is using this code. Once we have external users, we're stuck with the mess.

Any objections to making this change, before Randy attempts to move ca0.rpki.net?

Trac comment by sra on 2016-04-29T03:20:30Z

sraustein commented 8 years ago

On Fri, Apr 29, 2016 at 03:20:31AM -0000, Trac Ticket System wrote:

Any objections to making this change, before Randy attempts to move ca0.rpki.net?

Sounds good to me.

Trac comment by melkins on 2016-04-29T16:21:33Z

sraustein commented 8 years ago

sure

Trac comment by randy on 2016-04-29T22:49:22Z

sraustein commented 8 years ago

uh, any progress?

Trac comment by randy on 2016-05-05T07:13:26Z

sraustein commented 8 years ago

One last bug....

Trac comment by sra on 2016-05-05T11:52:16Z

sraustein commented 8 years ago

OK, in theory it's ready for an alpha tester.

This is a two stage process: the first stage runs on the machine you're evacuating, the second runs on the destination machine. This is deliberate, and should allow you to leave the old machine safely idle in case something goes horribly wrong and you need to revert.

In addition to the usual tools, you need two scripts:

On the old (trunk/) machine, you need https://subvert-rpki.hactrn.net/trunk/potpourri/ca-pickle.py
On the new (tk705/) machine, you need https://subvert-rpki.hactrn.net/branches/tk705/potpourri/ca-unpickle.py

You can fetch these using svn if you want to pull the whole source tree, or just fetch the individual scripts with wget, fetch, ....

On the old machine:

Stop the rpki servers (rpkid, irdbd, pubd).
Run ca-pickle.py. This takes one mandatory argument, the name of the output file. You can call this anything you like, but since it's xz-compressed it'd probably be less confusing to call it something ending in .xz:

{{{ sudo python ca-pickle.py pickled-rpki.xz }}}
Leave the servers shut off, and scp the file written by ca-pickle.py to the new machine.

On the new machine:

Make sure you have the latest tk705/ rpki-rp and rpki-ca packages installed. Given the recent incompatible change (discussed last week) to remove the Turtle model from the irdb, you may need to purge and reinstall to clear an upgrade error:

{{{ sudo apt-get update sudo apt-get purge rpki-ca rpki-rp sudo apt-get install rpki-rp rpki-ca }}}
The upgrade itself needs to take place with the servers disabled, and includes a bit of additional voodoo (notes follow):

{{{ sudo service rpki-ca stop sudo killall -u rpki sudo rm -rf /usr/share/rpki/.{tal,cer} /usr/share/rpki/publication/ /usr/share/rpki/rrdp-publication/* /var/log/rpki/* sudo rpki-sql-setup --postgresql-root-username postgres drop sudo install -d -o rpki -g rpki /var/run/rpki /var/log/rpki /usr/share/rpki/publication /usr/share/rpki/rrdp-publication sudo rpki-sql-setup --postgresql-root-username postgres create sudo sudo -u rpki rpki-manage migrate rpkidb --settings rpki.django_settings.rpkid --no-color sudo sudo -u rpki rpki-manage migrate pubdb --settings rpki.django_settings.pubd --no-color sudo sudo -u rpki rpki-manage migrate irdb --settings rpki.django_settings.irdb --no-color sudo sudo -u rpki rpki-manage migrate --settings rpki.django_settings.gui --no-color sudo sudo -u rpki python ca-unpickle.py --rootd pickled-rpki.xz rpkic update_bpki sudo service rpki-ca restart sleep 30 rpkic update_bpki 2>&1 }}}
If nothing horrible has happened yet, wait five or ten minutes for things to settle down, then you should be in business on the new server.

Notes on the long script above:

If you're running as root, you can omit any sudo which isn't immediately followed by a -u rpki.
The service command should shut down the servers. The killall is paranoia in case some cron job happens to be using the database at exactly the wrong moment -- PostgreSQL won't let you drop the database while any process has it open.
The rm, database drop, install and database create are just wiping the state already present from the install and whatever testing you did, so we can start fresh. The Django migrations are needed to rebuild the database schemas after the drop and create cycle.
ca-unpickle does the real work (more below). The --rootd flag says you want it to attempt to transition the keypair from an old rootd-based configuration. Don't specify this unless you need it, the rootd code is considerably more complicated (and fragile) than the rest of the upgrade.
The first rpkic update_bpki is expected to whine about not being able to push data into the servers, because you still have the servers turned off at this point. This is normal, and is the reason why you run it again after a short wait for the servers to start up.

As to what's really going on here:

The core mechanism is fairly simple: ca-pickle reads /etc/rpki.conf, the contents of the old MySQL databases, and whatever files it can locate from the names it sees in /etc/rpki.conf, loads them all into one big in-memory Python object (top level is a dict()), then runs that object through Python's cPickle module and the xz compressor to dump the whole thing as a portable file which should be readable by Python on any supported platform. A sufficiently big installation would hit memory problems with this approach, but I doubt that any current installation running this code has hit that limit yet.
ca-unpickle does two separate things after uncompressing and unpickling the data structure created by ca-pickle:
1. It translates the old data captured from MySQL into Django ORM objects on the new machine. This is tedious but straightforward, other than a few minor issues like updating machine-local URIs to match changed port numbers and so forth. For the most part, this is exactly the same thing we would have had to do in a Django data migration had we taken that approach, but without the requirement that the old and new databases both be reachable at the same time (or even be installed on the same machine).
2. If --rootd is specified, ca-unpickle also does some rather awful stuff to construct a usable rootd-less root configuration on the new machine. This is basically pushing on a rope, because the one rpkid data structure which absolutely must be preserved for this to work (the one that holds the RPKI root private key) is normally about six removes from direct control by anything in the back end; in order to make this work, we have to duplicate a lot of fiddly logic with parallel structures in the rpkidb and irdb databases. This is fancy nasty with raisins and cinnamon.
The reason you have to let things sit for a few minutes after the transition is that, even with all the awfulness described above, there's still some internal cleanup that the daemons have to perform after they regain control. For example, all of the "resource class" values in the RPKI up-down protocol have changed, because the trunk/ code was still using the awful hack of using SQL row index values as resource class names. Good riddance, but cleaning that up requires running a whole bunch of certificates have to run through a revoke and reissue cycle. This should all happen automatically, but it's not instantaneous.

Trac comment by sra on 2016-05-06T01:04:34Z

sraustein commented 8 years ago

how long is {{{ ca0.rpki.net:/root# python ca-pickle.py pickled-rpki.xz }}}

expected to run?

it's been maybe 15 minutes. and mysql-server is running

{{{ ca0.rpki.net:/root# mysql -p Enter password: Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 2940 Server version: 5.5.49 Source distribution

Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> quit; Bye }}}

Trac comment by randy on 2016-05-07T06:28:01Z

sraustein commented 8 years ago

ignore. it finally finished.

Trac comment by randy on 2016-05-07T06:32:33Z

sraustein commented 8 years ago

Out of curiosity, please post size of the xz file.

I don't think I've seen ca-pickle take more than five or ten seconds, but I was testing it with small data sets on a lightly loaded VM.

Trac comment by sra on 2016-05-07T06:39:04Z

sraustein commented 8 years ago

ca0.rpki.net:/root# l -h pickled-rpki.xz -rw------- 1 root wheel 7.1M May 7 06:28 pickled-rpki.xz

Trac comment by randy on 2016-05-07T06:40:00Z

sraustein commented 8 years ago

Removed confused instructions which led to #815. That part of the instructions was just plain wrong.

Trac comment by sra on 2016-05-09T05:44:11Z

sraustein commented 8 years ago

Added --root-handle argument to ca-unpickle, so you can do:

{{{ python ca-unpickle.py blarg.xz --rootd --root-handle Root }}}

so that the name of the entity created from the salvaged rootd data will be named "Root" instead of some randomly generated UUID.

If you already have an entity named "Root", this will fail with a SQL constraint violation when it discovers that you're creating a second Tenant with the same handle, but you want it to fail in such a case.

Trac comment by sra on 2016-05-09T17:55:55Z

sraustein commented 8 years ago

Noting something I figured out while writing a report for Sandy:

If we need to take this pickled database hack beyond what will easily fit in memory, one relatively simple way of breaking the problem up into chunks would be to use the Python shelve module with gdbm. So, eg, instead of one great big enormous pickle, we could pickle each SQL table in a separate slot of the shelve database; if necessary, we could break things down even smaller, but one shelf per table is an easy target.

Transfer format in this case would be a gdbm database, which we could then ship to another machine in portable format using the gdbm_dump and gdbm_load utilities, possibly compressed with xz for the same reasons we compress the current pickle format.

None of this is worth worrying about until and unless we hit a case which needs it, just making note of the technique while I remember it.

Trac comment by sra on 2016-05-10T18:06:40Z

dragonresearch / rpki.net

Ticket to track Migration of a Root CA #813

!CommitTicketReference repository="" revision="6395"