Open sraustein opened 8 years ago
is it safe to go through the root creation, pubd credentials, ... and patch in the migrated root key and resultant TAL later and then migrate the resources and users?
Define "safe".
So long as you don't publish the TAL you use during testing, nobody but us will ever know about it, so at worst you might have to apt-get purge
then reinstall.
Trac comment by sra on 2016-04-25T01:25:31Z
While I had originally expected to do this sort of transition using Django migrations, this is a big enough jump and likely enough to involve multiple machines that I think there's a simpler solution:
trunk/
code): read all relevant data (/etc/rpki.conf
, MySQL tables, *.{cer,key} files reachable from rpki.conf
) into Python memory as some kind of simple object (could be a custom class but probably simpler if it's just a dict/list/int/str thing like one would get from parsing JSON or YAML). Write all of that out as a single file, using Python's native "Pickle" format (Python-specific, but guaranteed to work with any version of Python on any hardware).tk705/
code): read the pickle to reproduce the in-Python-memory structure, then drop data into Django ORM objects as needed. Might try to do something clever with suggesting /etc/rpki.conf
changes based on comparison of what's set on new machine with what's in the pickle, but probably not.As a refinement, we might run the pickle through some compression program, both for size and, more importantly, for some kind of internal checksum to detect transfer errors while moving the pickle around. Heck, we could use gpg
to wrap it, but let's not get carried away.
It turns out that rendering the contents of /etc/rpki.conf
, a collection of MySQL databases, and disk files indicated by /etc/rpki.conf
as Python dict() objects is not particularly hard. There's some redundancy (particularly if one uses the optional feature in the MySQLdb API that returns each table row as a dict()), but Pickle format is good at identifying common objects, so it's not particularly wasteful except for a bit of CPU time while generating the pickle.
While this could be generalized into some kind of back-up-the-entire-CA mechanism, that would be mission creep. At the moment, I'm focused on a specific nasty transition, which includes the raw-MySQL to Django ORM jump, which is enough of a challenge for one script.
Another thing I like about this besides its (relative) simplicity is that one can save the intermediate format. Assuming we can get keep the part that generates the pickle simple enough, it should be straightforward to reassure ourselves that it has all the data we intended to save. Given that, we can isolate the more complex problem (unpacking the data into the new database) as a separate task, which we can run repeatedly until we get it right if that's what it takes: so long as the pickle is safe, no data has been lost.
Yes, of course we also tell the user to back up every freaking thing possible in addition to generating the pickle, even though we hope and intend that the pickle contains everything we need.
This scheme does assume that everything in a CA instance will fit in memory. That's not a safe assumption in the general case, but I think it's safe for everything we're likely to care about for this particular transition given state of play to date. There are variants on this scheme we could use if this were a problem, but I don't think it is.
Trac comment by sra on 2016-04-27T13:41:41Z
for transfer check, just sha1 it on both ends
do not care about efficiency. one does not do this daily.
being as tolerant of input issues on the /trunk side may be helpful. i have spared you a lot of horrifying logs, for example
{{{ Apr 26 00:08:40 ca0 rpkid[731]: MySQL exception with dirty objects in SQL cache! Apr 26 00:08:40 ca0 rpkid[731]: Unhandled exception processing up-down request: OperationalError: (2006, 'MySQL server has gone away') Apr 26 00:08:40 ca0 rpkid[731]: Unhandled exception processing up-down request: OperationalError: (2006, 'MySQL server has gone away') Apr 26 00:09:07 ca0 rpkid[731]: MySQL exception with dirty objects in SQL cache! Apr 26 00:09:07 ca0 rpkid[731]: Unhandled exception processing up-down request: OperationalError: (2006, 'MySQL server has gone away') Apr 26 00:09:07 ca0 rpkid[731]: Unhandled exception processing up-down request: OperationalError: (2006, 'MySQL server has gone away') }}}
not sure we need backup entire CA, as we can back up machine. as you can see from above, audit and fix CA might be useful.
i think the largest dataset that would migrate would be jpnic or cnnic.
Trac comment by randy on 2016-04-27T14:15:07Z
for transfer check, just sha1 it on both ends
Piping through "xz -C sha256" automates this.
do not care about efficiency. one does not do this daily.
Right.
being as tolerant of input issues on the /trunk side may be helpful.
Input side is just data capture, no analysis.
OperationalError: (2006, 'MySQL server has gone away')
You've been getting that on and off for years, with various causes, most commonly port upgrade turning mysqld off but not turning it back on. The other error messages quoted cascade from that.
not sure we need backup entire CA, as we can back up machine. as you can see from above, audit and fix CA might be useful.
I think we are quibbling about "entire CA". Intent is to capture data that needs to be in place on the new server to continue operation, along with some minor config data which we may never need but which is easiest to capture at the same time (eg, funny settings in rpki.conf).
i think the largest dataset that would migrate would be jpnic or cnnic.
Seems likely.
Trac comment by sra on 2016-04-27T20:48:49Z
In [changeset:"6395" 6395]: {{{
First step of transition mechanism from trunk/ to tk705/: script to encapsulate all (well, we hope) relevant configuration and state from a trunk/ CA in a form we can easily load on another machine, or on the same machine after a software upgrade, or ....
Transfer format is an ad hoc Python dictionary, encoded in Python's native "Pickle" format, compressed by "xz" with SHA-256 integrity checking enabled. See #807. }}}
Trac comment by sra on 2016-04-27T22:20:20Z
except the mysql server is running. and if i restart mysql-server, the error persists. this is a rabbit hole best left unexplored if possible. focus on migration.
it is not that i have not tried {{{ ca0.rpki.net:/root# mysql -p Enter password: Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 13 Server version: 5.5.49 Source distribution
Copyright (c) 2000, 2016, Oracle and/or its affiliates. All rights reserved.
Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql> Bye ca0.rpki.net:/root# tail /var/log/messages Apr 27 23:04:49 ca0 last message repeated 2 times Apr 27 23:05:45 ca0 sshd[12136]: Connection closed by 198.180.150.1 [preauth] Apr 27 23:05:45 ca0 sshguard[28474]: 198.180.150.1: should already have been blocked Apr 27 23:06:17 ca0 rpkid[731]: MySQL exception with dirty objects in SQL cache! Apr 27 23:06:17 ca0 rpkid[731]: Unhandled exception processing up-down request: OperationalError: (2006, 'MySQL server has gone away') Apr 27 23:06:49 ca0 rpkid[731]: cron keepalive threshold 2016-04-27T23:06:48Z has expired, breaking lock Apr 27 23:06:49 ca0 rpkid[731]: MySQL exception with dirty objects in SQL cache! Apr 27 23:06:49 ca0 rpkid[731]: Unhandled exception processing up-down request: OperationalError: (2006, 'MySQL server has gone away') Apr 27 23:08:17 ca0 rpkid[731]: MySQL exception with dirty objects in SQL cache! Apr 27 23:08:17 ca0 rpkid[731]: Unhandled exception processing up-down request: OperationalError: (2006, 'MySQL server has gone away') }}}
Trac comment by randy on 2016-04-27T23:09:28Z
I'm considering taking advantage of this pickled migration process to make one schema change which appears to be beyond the capabilities of Django's migration system.
Task::
Fold rpki.irdb.models.Turtle
model back into rpki.irdb.models.Parent
now that rpki.irdb.models.Rootd
is gone.
Users Affected:: Users of current tk705/ branch (me, Randy, Michael) may have to drop databases and rebuild, possibly losing all current data. OK, we could fix that too, but probably not worth the trouble for just the three of us and not yet anything in production.
Details::
Current table structure was complicated to allow a Repository
to link to either a Parent
or a Rootd
. Now that we no longer need (or have) Rootd
, we no longer need this complexity. Django ORM migrations throw up their hands and whimper when asked to make a change this fundamental to a SQL primary index column (I've tried, boy howdy how I have tried), but the pickled migration code doesn't care, because it doesn't need to modify SQL tables in place.
When:: If we're ever going to do this, we should do it now, before anybody else is using this code. Once we have external users, we're stuck with the mess.
Any objections to making this change, before Randy attempts to move ca0.rpki.net?
Trac comment by sra on 2016-04-29T03:20:30Z
On Fri, Apr 29, 2016 at 03:20:31AM -0000, Trac Ticket System wrote:
Any objections to making this change, before Randy attempts to move ca0.rpki.net?
Sounds good to me.
Trac comment by melkins on 2016-04-29T16:21:33Z
sure
Trac comment by randy on 2016-04-29T22:49:22Z
uh, any progress?
Trac comment by randy on 2016-05-05T07:13:26Z
One last bug....
Trac comment by sra on 2016-05-05T11:52:16Z
OK, in theory it's ready for an alpha tester.
This is a two stage process: the first stage runs on the machine you're evacuating, the second runs on the destination machine. This is deliberate, and should allow you to leave the old machine safely idle in case something goes horribly wrong and you need to revert.
In addition to the usual tools, you need two scripts:
trunk/
) machine, you need
https://subvert-rpki.hactrn.net/trunk/potpourri/ca-pickle.pytk705/
) machine, you need
https://subvert-rpki.hactrn.net/branches/tk705/potpourri/ca-unpickle.pyYou can fetch these using svn if you want to pull the whole source tree, or just fetch the individual scripts with wget, fetch, ....
On the old machine:
Run ca-pickle.py
. This takes one mandatory argument, the name of
the output file. You can call this anything you like, but since
it's xz-compressed it'd probably be less confusing to call it
something ending in .xz
:
{{{ sudo python ca-pickle.py pickled-rpki.xz }}}
ca-pickle.py
to the new machine.On the new machine:
Make sure you have the latest tk705/
rpki-rp
and rpki-ca
packages installed. Given the recent incompatible change (discussed
last week) to remove the Turtle
model from the irdb, you may need
to purge and reinstall to clear an upgrade error:
{{{ sudo apt-get update sudo apt-get purge rpki-ca rpki-rp sudo apt-get install rpki-rp rpki-ca }}}
The upgrade itself needs to take place with the servers disabled, and includes a bit of additional voodoo (notes follow):
{{{ sudo service rpki-ca stop sudo killall -u rpki sudo rm -rf /usr/share/rpki/.{tal,cer} /usr/share/rpki/publication/ /usr/share/rpki/rrdp-publication/* /var/log/rpki/* sudo rpki-sql-setup --postgresql-root-username postgres drop sudo install -d -o rpki -g rpki /var/run/rpki /var/log/rpki /usr/share/rpki/publication /usr/share/rpki/rrdp-publication sudo rpki-sql-setup --postgresql-root-username postgres create sudo sudo -u rpki rpki-manage migrate rpkidb --settings rpki.django_settings.rpkid --no-color sudo sudo -u rpki rpki-manage migrate pubdb --settings rpki.django_settings.pubd --no-color sudo sudo -u rpki rpki-manage migrate irdb --settings rpki.django_settings.irdb --no-color sudo sudo -u rpki rpki-manage migrate --settings rpki.django_settings.gui --no-color sudo sudo -u rpki python ca-unpickle.py --rootd pickled-rpki.xz rpkic update_bpki sudo service rpki-ca restart sleep 30 rpkic update_bpki 2>&1 }}}
Notes on the long script above:
sudo
which isn't
immediately followed by a -u rpki
.ca-unpickle
does the real work (more below). The --rootd
flag
says you want it to attempt to transition the keypair from an old
rootd-based configuration. Don't specify this unless you need it,
the rootd code is considerably more complicated (and fragile) than
the rest of the upgrade.rpkic update_bpki
is expected to whine about not being
able to push data into the servers, because you still have the
servers turned off at this point. This is normal, and is the reason
why you run it again after a short wait for the servers to start
up.As to what's really going on here:
ca-pickle
reads
/etc/rpki.conf
, the contents of the old MySQL databases, and
whatever files it can locate from the names it sees in
/etc/rpki.conf
, loads them all into one big in-memory Python
object (top level is a dict()
), then runs that object through
Python's cPickle module and the xz
compressor to dump the whole
thing as a portable file which should be readable by Python on any
supported platform. A sufficiently big installation would hit
memory problems with this approach, but I doubt that any current
installation running this code has hit that limit yet.ca-unpickle
does two separate things after uncompressing and
unpickling the data structure created by ca-pickle
:
--rootd
is specified, ca-unpickle
also does some rather
awful stuff to construct a usable rootd-less root configuration
on the new machine. This is basically pushing on a rope, because
the one rpkid data structure which absolutely must be preserved
for this to work (the one that holds the RPKI root private key)
is normally about six removes from direct control by anything in
the back end; in order to make this work, we have to duplicate a
lot of fiddly logic with parallel structures in the rpkidb and
irdb databases. This is fancy nasty with raisins and cinnamon.trunk/
code was still using the awful hack of using SQL row index
values as resource class names. Good riddance, but cleaning that up
requires running a whole bunch of certificates have to run through a
revoke and reissue cycle. This should all happen automatically, but
it's not instantaneous.Trac comment by sra on 2016-05-06T01:04:34Z
how long is {{{ ca0.rpki.net:/root# python ca-pickle.py pickled-rpki.xz }}}
expected to run?
it's been maybe 15 minutes. and mysql-server is running
{{{ ca0.rpki.net:/root# mysql -p Enter password: Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 2940 Server version: 5.5.49 Source distribution
Copyright (c) 2000, 2016, Oracle and/or its affiliates. All rights reserved.
Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql> quit; Bye }}}
Trac comment by randy on 2016-05-07T06:28:01Z
ignore. it finally finished.
Trac comment by randy on 2016-05-07T06:32:33Z
Out of curiosity, please post size of the xz file.
I don't think I've seen ca-pickle take more than five or ten seconds, but I was testing it with small data sets on a lightly loaded VM.
Trac comment by sra on 2016-05-07T06:39:04Z
ca0.rpki.net:/root# l -h pickled-rpki.xz -rw------- 1 root wheel 7.1M May 7 06:28 pickled-rpki.xz
Trac comment by randy on 2016-05-07T06:40:00Z
Removed confused instructions which led to #815. That part of the instructions was just plain wrong.
Trac comment by sra on 2016-05-09T05:44:11Z
Added --root-handle
argument to ca-unpickle, so you can do:
{{{ python ca-unpickle.py blarg.xz --rootd --root-handle Root }}}
so that the name of the entity created from the salvaged rootd data will be named "Root" instead of some randomly generated UUID.
If you already have an entity named "Root", this will fail with a SQL constraint violation when it discovers that you're creating a second Tenant with the same handle, but you want it to fail in such a case.
Trac comment by sra on 2016-05-09T17:55:55Z
Noting something I figured out while writing a report for Sandy:
If we need to take this pickled database hack beyond what will easily fit in memory, one relatively simple way of breaking the problem up into chunks would be to use the Python shelve
module with gdbm. So, eg, instead of one great big enormous pickle, we could pickle each SQL table in a separate slot of the shelve database; if necessary, we could break things down even smaller, but one shelf per table is an easy target.
Transfer format in this case would be a gdbm database, which we could then ship to another machine in portable format using the gdbm_dump
and gdbm_load
utilities, possibly compressed with xz
for the same reasons we compress the current pickle format.
None of this is worth worrying about until and unless we hit a case which needs it, just making note of the technique while I remember it.
Trac comment by sra on 2016-05-10T18:06:40Z
is it safe to go through the root creation, pubd credentials, ... and patch in the migrated root key and resultant TAL later and then migrate the resources and users?
Trac ticket #807 component rpkid priority major, owner None, created by randy on 2016-04-25T01:02:57Z, last modified 2016-05-10T18:06:40Z