dragonresearch / rpki.net

Dragon Research Labs rpki.net RPKI toolkit
53 stars 30 forks source link

ca0.rpki.mnet (old /trunk): show_received_resources times out #822

Closed sraustein closed 8 years ago

sraustein commented 8 years ago

{{{ rpkic> select_identity rgnet rpkic> show_received_resources 2016-05-08 10:25:12 rpkic[69957]: Timeout while in state request-sent 2016-05-08 10:25:12 rpkic[69957]: Error on HTTP client connection ca0.rpki.net:4404 <class 'rpki.exceptions.HTTPTimeout'> 2016-05-08 10:25:12 rpkic[69957]: Closing due to error: HTTPTimeout 2016-05-08 10:25:12 rpkic[69957]: Returning exception HTTPTimeout() to caller: Traceback (most recent call last): File "/usr/local/lib/python2.7/site-packages/rpki/cli.py", line 70, in onecmd return cmd.Cmd.onecmd(self, line) File "/usr/local/lib/python2.7/cmd.py", line 221, in onecmd return func(arg) File "/usr/local/lib/python2.7/site-packages/rpki/cli.py", line 259, in wrapped return func(self, parser.parse_args(shlex.split(arg))) File "/usr/local/lib/python2.7/site-packages/rpki/rpkic.py", line 612, in do_show_received_resources rpki.left_right.list_received_resources_elt.make_pdu(self_handle = self.zoo.handle)): File "/usr/local/lib/python2.7/site-packages/rpki/irdb/zookeeper.py", line 1074, in call_rpkid return call_rpkid(*pdus) File "/usr/local/lib/python2.7/site-packages/rpki/http.py", line 808, in handle_timeout raise rpki.exceptions.HTTPTimeout HTTPTimeout rpkic> show_roa_requests ASN: 3927 IPv4: 198.180.150.0/24 ASN: 3970 IPv4: 192.169.0.0/23 ASN: 3970 IPv4: 216.21.0.0/24 ASN: 3130 IPv4: 198.180.153.0/24 ASN: 3130 IPv4: 147.28.0.0/16 ASN: 3130 IPv4: 192.83.230.0/24 ASN: 3130 IPv4: 198.133.206.0/24 ASN: 4128 IPv4: 147.28.224.0/19 ASN: 4128 IPv4: 198.180.152.0/24 rpkic> }}}

Trac ticket #816 component rpkid priority blocker, owner None, created by randy on 2016-05-08T14:24:19Z, last modified 2016-06-08T22:21:57Z

sraustein commented 8 years ago

What machine is this on, and was rpkid in fact running on this machine at the time you ran this rpkic command?

Trac comment by sra on 2016-05-08T16:43:05Z

sraustein commented 8 years ago

it is running on ca0.rpki.net as it almost says in the subject, aka altCA.

as i have been warning for a couple of months, the logs from that machine have been scary. which is why i put some urgency on moving off of it.

yes, mysql and rpkid were/are running. notice i showed you that it successfully did

{{{ rpkic> show_roa_requests ASN: 3927 IPv4: 198.180.150.0/24 ASN: 3970 IPv4: 192.169.0.0/23 ASN: 3970 IPv4: 216.21.0.0/24 ASN: 3130 IPv4: 198.180.153.0/24 ASN: 3130 IPv4: 147.28.0.0/16 ASN: 3130 IPv4: 192.83.230.0/24 ASN: 3130 IPv4: 198.133.206.0/24 ASN: 4128 IPv4: 147.28.224.0/19 ASN: 4128 IPv4: 198.180.152.0/24 rpkic> }}}

Trac comment by randy on 2016-05-08T21:32:36Z

sraustein commented 8 years ago

rpkic> show_roa_requests

That doesn't involve talking to servers, just looking in the IRDB.

ps and lsof show that rpkid is running on ca0.rpki.net, but its log shows it unable to talk to mysqld, or likely to do much of anything else until restarted.

In any case, none of this has any obvious link to stage of the GUI in the new instance.

Trac comment by sra on 2016-05-08T22:43:58Z

sraustein commented 8 years ago

it has been restarted many times.

i suspect the problem is that mysql is not recognizing the rpki user

i wondered if the user/pass part of the mysql database was farbled, so i reset it and tested and it seems ok

{{{ ca0.rpki.net:/root# mysql -u rpki -p rpkid Enter password: Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A ... }}}

i restarted servers just for the heck of it

{{{ ca0.rpki.net:/root# service rpki-ca restart ca0.rpki.net:/root# psg rpki.conf root 3673 3.2 4.1 169804 41544 - Ss 10:58PM 0:00.37 /usr/local/bin/python2.7 /usr/local/libexec/irdbd --config /usr/local/etc/rpki.conf --log-level warning --log-syslog daemon root 3675 1.6 2.4 139544 24720 - Ss 10:58PM 0:00.06 /usr/local/bin/python2.7 /usr/local/libexec/rpkid --config /usr/local/etc/rpki.conf --log-level warning --log-syslog daemon root 3677 1.0 2.3 138516 23548 - Ss 10:58PM 0:00.01 /usr/local/bin/python2.7 /usr/local/libexec/pubd --config /usr/local/etc/rpki.conf --log-level warning --log-syslog daemon root 3679 1.0 2.3 139024 23868 - Ss 10:58PM 0:00.00 /usr/local/bin/python2.7 /usr/local/libexec/rootd --config /usr/local/etc/rpki.conf --log-level warning --log-syslog daemon }}}

the symptom reproduces

{{{ ca0.rpki.net:/root# rpkic rpkic> select_identity rgnet rpkic> show_received_resources 2016-05-08 23:04:52 rpkic[3685]: Timeout while in state request-sent 2016-05-08 23:04:52 rpkic[3685]: Error on HTTP client connection ca0.rpki.net:4404 <class 'rpki.exceptions.HTTPTimeout'> 2016-05-08 23:04:52 rpkic[3685]: Closing due to error: HTTPTimeout 2016-05-08 23:04:52 rpkic[3685]: Returning exception HTTPTimeout() to caller: Traceback (most recent call last): File "/usr/local/lib/python2.7/site-packages/rpki/cli.py", line 70, in onecmd return cmd.Cmd.onecmd(self, line) File "/usr/local/lib/python2.7/cmd.py", line 221, in onecmd return func(arg) File "/usr/local/lib/python2.7/site-packages/rpki/cli.py", line 259, in wrapped return func(self, parser.parse_args(shlex.split(arg))) File "/usr/local/lib/python2.7/site-packages/rpki/rpkic.py", line 612, in do_show_received_resources rpki.left_right.list_received_resources_elt.make_pdu(self_handle = self.zoo.handle)): File "/usr/local/lib/python2.7/site-packages/rpki/irdb/zookeeper.py", line 1074, in call_rpkid return call_rpkid(*pdus) File "/usr/local/lib/python2.7/site-packages/rpki/http.py", line 808, in handle_timeout raise rpki.exceptions.HTTPTimeout HTTPTimeout }}}

Trac comment by randy on 2016-05-08T23:07:57Z

sraustein commented 8 years ago

Running the MySQL login sequence in the Python DB API by hand works, so it's not (just) that the Python MySQLdb module is broken.

Logging reconfigured to use files in /var/log/rpki so we can enable more verbose logging, backtraces, etc, and maybe get more details.

Trac comment by sra on 2016-05-08T23:35:47Z

sraustein commented 8 years ago

from /var/log/rpkid

{{{ 2016-05-08 23:40:58 rpkid[5604]: Starting cron run 2016-05-08 23:40:58 rpkid[5604]: Error pulling self_elts from SQL, maybe SQL server is down? Traceback (most recent call last): File "/usr/local/lib/python2.7/site-packages/rpki/rpkid.py", line 373, in cron selves = rpki.left_right.self_elt.sql_fetch_all(self) File "/usr/local/lib/python2.7/site-packages/rpki/sql.py", line 248, in sql_fetch_all return cls.sql_fetch_where(gctx, None) File "/usr/local/lib/python2.7/site-packages/rpki/sql.py", line 259, in sql_fetch_where gctx.sql.execute(cls.sql_template.select) File "/usr/local/lib/python2.7/site-packages/rpki/sql.py", line 115, in execute return self._wrap_execute(self.cur.execute, query, args) File "/usr/local/lib/python2.7/site-packages/rpki/sql.py", line 105, in _wrap_execute if self.ping_threshold is None or now > self.timestamp + self.ping_threshold: AttributeError: 'session' object has no attribute 'ping_threshold' }}}

Trac comment by randy on 2016-05-08T23:42:10Z

sraustein commented 8 years ago

MySQLdb API is still raising

{{{ OperationalError: (2006, 'MySQL server has gone away') }}}

This is coming directly from MySQLdb, not anything we wrote.

As a test, I have it now running the Connection.ping() command immediately before every Cursor.execute(). The ping is succeeding, and the execute is failing a few milliseconds later.

So this looks like a bug in mysqld or MySQLdb. Dunno how or why. Not obvious what to do about it either.

Will check back after dinner but at the moment I'm stumped.

Not obvious that we should be putting serious work into debugging this, given that we want both this instance and this code base to go away.

Trac comment by sra on 2016-05-08T23:46:46Z

sraustein commented 8 years ago

{{{ AttributeError: 'session' object has no attribute 'ping_threshold' }}}

Ignore, that one was just a typo during a test.

Trac comment by sra on 2016-05-08T23:48:15Z

sraustein commented 8 years ago

agree that debugging may be wasted. but we need those data. though i suppose i could type a lot.

Trac comment by randy on 2016-05-08T23:48:24Z

sraustein commented 8 years ago

agree that debugging may be wasted. but we need those data.

I am pretty sure we already have those data on newca, or at least in the pickle file.

If it's not in the pickle file, we probably don't have it at all.

Perhaps it would make more sense to focus on newca?

Trac comment by sra on 2016-05-08T23:49:47Z

sraustein commented 8 years ago

if i can not migrate, others may have problems. my first victim was gonna be mark or chris.

and i doubt i recorded the user/pass for the children. but i guess i could wait for the whining. except 50% will just walk away.

go have dinner.

Trac comment by randy on 2016-05-08T23:51:57Z

sraustein commented 8 years ago

I doubt that the current problem on the old CA has anything to do with migration per se. Your broken machine does not necessarily translate to migration problems for anybody else.

One possible culprit for your broken machine might be SQL rights settings. It occurred to me that we're not seeing problems on read operations, only on write operations. Not quite a smoking gun but worth checking.

Trac comment by sra on 2016-05-09T00:04:32Z

sraustein commented 8 years ago

It occurred to me that we're not seeing problems on read operations, only on write operations.

So much for that theory. See /root/test-sql.py, which opens rpkid's database, makes then undoes a trivial change, without errors.

So I still don't know what MySQLdb is on about here. The only thing left that I can think of is that there's something about the particular SQL commands in question, perhaps some new permutation of the "I'm sorry Sir but your ASN.1 DER BLOBs are not valid UTF-8" idiocy, or something like that.

Trac comment by sra on 2016-05-09T01:29:05Z

sraustein commented 8 years ago

i.e. it might affect chris, mark, ...

randy

Trac comment by randy on 2016-05-09T03:32:20Z

sraustein commented 8 years ago

i.e. it might affect chris, mark, ...

WHAT might affect Chris, Mark, ... ?

We have no evidence that there's any connection whatsoever between an inexplicable bug that's been going on for months in one instance of the old code and hypothetical problems with the transition script that will let you and others move to the new code?

I really don't get this.

Trac comment by sra on 2016-05-09T03:50:53Z

sraustein commented 8 years ago

and how am i supposed to get the root cert and tal over from ca0.rpki.net to the newca0?

Trac comment by randy on 2016-05-09T09:02:19Z

sraustein commented 8 years ago

and how am i supposed to get the root cert and tal over from ca0.rpki.net to the newca0?

They should already be in the database on newca0, you just need to extract them. You already know the commands for that: rpkic's extract_root_tal and extract_root_certificate commands. So:

{{{ rpkic list_tenant_handles }}}

Pick the one you didn't create (it'll look like a UUID), call that $handle.

{{{ rpkic -i $handle extract_root_tal rpkic -i $handle extract_root_certificate }}}

and install the resulting files in /usr/share/rpki/ and /usr/share/rpki/rrdp-publication/, respectively.

Trac comment by sra on 2016-05-09T11:17:14Z

sraustein commented 8 years ago

Sorry, should have said:

Installing the .tal file in /usr/share/rpki/ is just a convention, it serve any direct purpose.

Installing the root certificate into /usr/share/rpki/rrdp-publication/ puts it where the generated TAL expects to find it when using HTTPS: in the top-level directory of the https://host.name/rrdp/ tree.

The reason why I like putting the .tal into /usr/share/rpki/ is that it's relatively easy to remember and gives me a simple way to do a quick test with rcynic:

{{{ rcynic -c /dev/null -t /usr/share/rpki -x rcynic.xml --log-level debug }}}

Since the default SQL configuration for rcynic in the absence of a configuration file is SQLite3 with a database of ./rcynic.db, this is a quick way of saying "run rcynic using the local TAL as a starting point with output to an SQLite3 database and an XML file in the current directory". Not something one wants to run in production, but useful for quick tests to see whether the local CA is publishing something that looks halfway sane.

Trac comment by sra on 2016-05-09T11:54:49Z

sraustein commented 8 years ago

and how am i supposed to get the root cert and tal over from ca0.rpki.net to the newca0? They should already be in the database on newca0

i started over from apt-get purge. otherwise i had a really half-assed system, roas for resources i did not have etc.

can't i just extract only them from the pickle or some hack like that?

Trac comment by randy on 2016-05-09T13:47:36Z

sraustein commented 8 years ago

so do i o go through a 'normal' rooted creation and then also import? o or import and then go through a root creation? o or just import?

and do i delete all the imported dangling roas etc. and then allocate resources from scratch, node-below-root down?

and do i detele the imported children and recreate them?

Trac comment by randy on 2016-05-09T14:20:50Z

sraustein commented 8 years ago

to be clear, why i want clear instructions

ca0 is in deep crap with a broken database. this is painful. importing it into a clean new ca, newca0, sounded like a plan until it didn't. and having half a load of crap from a known broken database feels like a weak recipe for a nice new start.

but operationally i need the root cert to be the same.

randy

Trac comment by randy on 2016-05-09T14:37:18Z

sraustein commented 8 years ago

i started over from apt-get purge. otherwise i had a really half-assed system, roas for resources i did not have etc.

No, we don't know what you had, just what the GUI told you. You never got as far as examining the CA's real output.

In general, rpkid will not tolerate resource inconsistencies, so if the translation script had somehow created the kind of mess you're describing, rpkid probably would have whacked the result into a consistent (albeit probably sad) state within a few cron cycles.

can't i just extract only them from the pickle or some hack like that?

Not sure which "only them" you're talking about, but the answer is "probably not.

ca0 is in deep crap with a broken database.

Note that the failure mode we're seeing at the moment on ca0 is a database we can't update, ie, if this is the original problem, the contents are stale but probably otherwise internally consistent.

Of course there might be other problems, but we don't know that.

but operationally i need the root cert to be the same.

ca-unpickle is the tool we have for this.

Trac comment by sra on 2016-05-09T17:16:29Z

sraustein commented 8 years ago

chris wants to rebuild bpki relationship with a new server on his side. ca0.rpki.net is sorely broken. newca0.rpki.net does not have the same root cert. see comments 20 & 21.

Trac comment by randy on 2016-05-19T03:18:43Z

sraustein commented 8 years ago

I already wrote a script that attempts the migration you claim you want. If I understand correctly, you refuse to use it, on the grounds that the result the one time you tried it "looked half-assed", so you blew away the result before we had time to figure out what the problem was, how close it had gotten to doing the right thing, whether the weirdness you saw in the GUI indicated a fundamental problem or was just something needing manual refresh, etc.

Please also note that you're trying a only-slightly-untested migration process on just about the hardest possible case: the CA from which you are migrating is a root with a widely distributed TAL, and the CA in question is so badly broken that you don't trust its data even when we do try to salvage it.

I would respectfully suggest that expecting "clear instructions" on how to handle a mess like this is unreasonable.

Trac comment by sra on 2016-05-19T03:36:14Z

sraustein commented 8 years ago

Attempting to be more constructive: what's the minimum you need done in order to solve your problem? How much of the old data do you really need, and how much were you just going to recreate anyway?

Trac comment by sra on 2016-05-19T03:45:44Z

sraustein commented 8 years ago

Attempting to be more constructive: what's the minimum you need done in order to solve your problem? How much of the old data do you really need, and how much were you just going to recreate anyway?

the root CA so the credentials stay the same, TA

Trac comment by randy on 2016-05-19T07:20:33Z

sraustein commented 8 years ago

give the root ca, i can enter the rest by hand. and the only child, google wants to re-associate anyway.

Trac comment by randy on 2016-05-19T07:24:25Z

sraustein commented 8 years ago

Please also note that you're trying a only-slightly-untested migration process on just about the hardest possible case: the CA from which you are migrating is a root with a widely distributed TAL, and the CA in question is so badly broken that you don't trust its data even when we do try to salvage it.

i fully agree

I would respectfully suggest that expecting "clear instructions" on how to handle a mess like this is unreasonable.

i actually just need the TA[L]. and i have the root key etc. can we use a hack of the old generate-root-certificate script?

Trac comment by randy on 2016-05-19T07:27:37Z

sraustein commented 8 years ago

an we use a hack of the old generate-root-certificate script?

and just wind the prng back in time to when the TA was generated :)

Trac comment by randy on 2016-05-19T08:12:23Z

sraustein commented 8 years ago

OK, that's certainly a more constrained problem space. Still requires some custom code, but a lot less.

I think the general approach there would be to do a clean install of the new (tk705) code, create a root parent object (rpkic configure_root), then shut down the daemons and run a magic script which whacks a few fields in the newly-created CADetail to stuff in the salvaged root certificate and root keypair and arrange everything will look right the next time this CADetail issues something. Then cross fingers, restart servers, and maybe issue a few rpkic commands intended to flush the changes through without doing anything too drastic.

Note that this is still surgery on the internals of rpkid's database. Laparoscopic surgery, but still surgery.

The critical inputs to this from the old system are just the RPKI root certificate and the RPKI root key, both of which are plain files in the rootd universe.

Not sure I'm going to get to this today, and if I don't, I sure won't get to it tomorrow (offline most of the day), but soon-ish.

can we use a hack of the old generate-root-certificate script?

Only if you want to stick with rootd. Didn't think so.

Trac comment by sra on 2016-05-19T12:31:03Z

sraustein commented 8 years ago

this sounds an excellent plan. i will start the rebuild now.

Trac comment by randy on 2016-05-19T14:00:49Z

sraustein commented 8 years ago

what identity and publication dance do i perform?

{{{ rpkic create_identity altCA }}}

and then skip to {{{ rpkic configure_root }}}

and then {{{ rpkic configure_parent ... rpkic configure_publication_client ... rpkic configure_repository ... }}}

then run the receive side of your forthcoming hack

then {{{ rpkic extract_root_certificate rpkic extract_root_tal }}}

and bob's your uncle?

Trac comment by randy on 2016-05-19T14:33:36Z

sraustein commented 8 years ago

what identity and publication dance do i perform?

{{{ rpkic create_identity altCA }}}

and then skip to {{{ rpkic configure_root }}}

and then {{{ rpkic configure_parent ... rpkic configure_publication_client ... rpkic configure_repository ... }}}

then run the receive side of your forthcoming hack

then {{{ rpkic extract_root_certificate rpkic extract_root_tal }}}

and bob's your uncle?

Something like that. Not sure yet exactly where in this sequence will want to run the magic script, might be where you put it, might be just after configure_parent, might not care which of those locations.

Trac comment by sra on 2016-05-20T02:38:11Z

sraustein commented 8 years ago

In [changeset:"6438" 6438]: {{{

!CommitTicketReference repository="" revision="6438"

New script to install an RPKI root key salvaged from an old rootd installation on top of the active CADetail of a new-style rootd-less rpkid configuration.

This has been tested, but only lightly, and only in the lab.

This script is dangerous. Do not run it unless you really know what you are doing, and even then you probably do not want to run it on anything but a brand new installation with no existing RPKI data.

See #816. }}}

Trac comment by sra on 2016-05-25T03:16:50Z

sraustein commented 8 years ago

New hack is https:/subvert-rpki.hactrn.net/branches/tk705/potpourri/whack-rpki-root-key.py

It takes the RPKI root key you want to install as an argument, and insists on an extra argument to confirm that you really want to do this. Use --help for details.

Note that this is taking the root //key//, not the old root //certificate//. This is deliberate. For purposes of keeping an old TAL alive, the only things you care about are the public key and the URI of the root certificate. The later (URI) you may have to hack via rsyncd.conf and DNS games, but it's beyond the scope of anything this script can touch in any case. So we just preserve the root key, and regenerate everything else in place.

By any normal PKI standards this is demented: we reuse the certificate serial numbers of the root and manifest certificates, we reuse the CRL and manifest serial numbers, we reuse all the timestamps. The intent is that //only// the key and things derived directly from the key (mostly g(SKI)-derived URIs) change. For the intended usage, we can get away with all this because nobody is using anything the CA publishes yet, so nobody will care if we stomp on those data.

Aside from keeping the script minimal for its own sake, this approach has one other important benefit: at the SQL level, we're updating exactly one row in exactly one table. This simplifies things considerably, and, since the update happens at the very end of the script, in theory we never get as far as the update unless everything else appears to have gone correctly.

The script assumes that there will be exactly one Parent object with root attributes (created by rpkic configure_root); in theory, the SQL lookup will blow out with an error if this assumption is wrong, but this has not been tested.

The script has an optional backup feature to let you save a copy of the old key it's about to stomp, before overwriting that key with the one you supply. Use it or not as amuses you.

Assuming the script runs to completion without obvious error, the next step is to restart the daemons, then use rpkic force_publication to tell rpkid that it should resync with pubd. This should clean up all traces of the former key, publish the new stuff, and all should be well. It appeared to work when I tried it.

Once you've done that, you should be OK to extract the new root certificate and TAL, then proceed normally with adding children.

Trac comment by sra on 2016-05-25T03:56:56Z

sraustein commented 8 years ago

will do this today, after coffee. thanks!

Trac comment by randy on 2016-05-25T05:05:10Z

sraustein commented 8 years ago

{{{ oldca0.rpki.net:/root# ./whack-rpki-root-key.py --help Traceback (most recent call last): File "./whack-rpki-root-key.py", line 47, in cfg = rpki.config.argparser(doc = doc) AttributeError: 'module' object has no attribute 'argparser' }}}

Trac comment by randy on 2016-05-27T07:23:54Z

sraustein commented 8 years ago

works on ca0 (xenial) not oldca0 (freebsd)

Trac comment by randy on 2016-05-27T07:54:04Z

sraustein commented 8 years ago

Not expected to work on oldca. What would it do there?

Sent from my phone. Please excuse brevity and typos.

Trac comment by sra on 2016-05-27T10:57:00Z

sraustein commented 8 years ago

I think this is now fixed, for a sufficiently lenient (or at least loosely scoped) definition of "fixed".

Closing, reopen if there's work still to do on this.

Trac comment by sra on 2016-06-08T22:21:57Z

sraustein commented 8 years ago

Closed with resolution fixed