dragonresearch / rpki.net

Dragon Research Labs rpki.net RPKI toolkit
54 stars 26 forks source link

labuser05 blows chunks #693

Closed sraustein closed 8 years ago

sraustein commented 10 years ago

during workshop, labuser05 had no resources. if i tried to add, server error. so i deleted labuer05 but there seemed to be no way to create a user from workshop account

Trac ticket #681 component gui priority blocker, owner melkins, created by randy on 2014-02-20T09:33:56Z, last modified 2016-07-18T10:45:01Z

sraustein commented 10 years ago

error.log.gz apache error log Trac attachment by randy on 2016-07-18T10:45:01Z

sraustein commented 10 years ago

{{{ sra@thrintun> host vmini.rpki.net vmini.rpki.net has address 10.10.0.242 }}}

Trac comment by sra on 2014-02-20T13:46:09Z

sraustein commented 10 years ago

yes, we are natted here. sorry.

Trac comment by randy on 2014-02-20T14:11:20Z

sraustein commented 10 years ago

yes, we are natted here. sorry.

Understood, no blame. Just: If nobody can get in to help you debug, you're on your own. Sorry.

Most useful thing I can think of in terms of preserving evidence for later analysis would be a dump of all the SQL data. In theory, running:

{{{ ca0.vmini.rpki.net# rpki-sql-backup >/root/wtf.sql }}}

should do it, haven't tested recently, if that fails there's always running mysqldump manually.

Secondary evidence you might want to preserve if you can are the publication tree from ca0.vmini (/usr/share/rpki/publication on Ubuntu, IIRC) and the /var/rcynic tree from cache0.vmini. In theory, everything relevant in those is already in the SQL data, but....

Trac comment by sra on 2014-02-20T14:22:35Z

sraustein commented 10 years ago

daytime tomorrow when i can get to the box

we should do a reverse tunnel hack for this use

Trac comment by randy on 2014-02-20T14:25:30Z

sraustein commented 10 years ago

we should do a reverse tunnel hack for this use

Probably can piggy-back on existing BGP tunnel hack.

Trac comment by sra on 2014-02-20T14:45:28Z

sraustein commented 10 years ago

On 02/20/2014 01:33 AM, Trac Ticket System wrote:

681: labuser05 blows chunks

--------------------+------------------ Reporter: randy | Owner: Type: defect | Status: new Priority: blocker | Component: gui Keywords: | Blocked By: Blocking: | --------------------+------------------ during workshop, labuser05 had no resources. if i tried to add, server error. so i deleted labuer05 but there seemed to be no way to create a user from workshop account

You should be able to recreate labuser05 by doing the following (as a superuser logged into the GUI):

  1. click on 'resource holders' on sidebar
  2. click on 'create' button
  3. enter 'labuser05' in handle field, select 'rgnet' in 'parent' drop-down box
  4. click the 'save' button
  5. click on 'web users' on sidebar
  6. click the 'create' button
  7. enter 'labuser05' in the 'username' field
  8. enter your favorite password in the 'password' field (this is what the user will log in with, not superuser password)
  9. re-enter the password
  10. select 'labuser05' from the listbox labelled 'resource holders'
  11. click the save button.

Now the rgnet resource holder can delegate some resources to labuser05.

Trac comment by melkins on 2014-02-20T16:56:20Z

sraustein commented 10 years ago

ca0.vmini.rpki.net# rpki-sql-backup >/root/wtf.sql

http://archive.psg.com/wtf.sql.gz

Secondary evidence you might want to preserve if you can are the publication tree from ca0.vmini

it publishes on ca0.rpki.net

randy

Trac comment by randy on 2014-02-21T02:43:41Z

sraustein commented 10 years ago

{{{ [Thu Feb 20 09:32:17 2014] [error] ERROR 2014-02-20 09:32:17,862 django.request Internal Server Error: /rpki/refresh [Thu Feb 20 09:32:17 2014] [error] Traceback (most recent call last): [Thu Feb 20 09:32:17 2014] [error] File "/usr/lib/python2.7/dist-packages/django/core/handlers/base.py", line 115, in get_response [Thu Feb 20 09:32:17 2014] [error] response = callback(request, _callback_args, _callback_kwargs) [Thu Feb 20 09:32:17 2014] [error] File "/usr/lib/python2.7/dist-packages/django/contrib/auth/decorators.py", line 25, in _wrapped_view [Thu Feb 20 09:32:17 2014] [error] return view_func(request, _args, _kwargs) [Thu Feb 20 09:32:17 2014] [error] File "/usr/lib/python2.7/dist-packages/rpki/gui/decorators.py", line 30, in _tls_required [Thu Feb 20 09:32:17 2014] [error] return f(request, _args, _kwargs) [Thu Feb 20 09:32:17 2014] [error] File "/usr/lib/python2.7/dist-packages/rpki/gui/app/views.py", line 94, in wrapped_fn [Thu Feb 20 09:32:17 2014] [error] return f(request, _args, _kwargs) [Thu Feb 20 09:32:17 2014] [error] File "/usr/lib/python2.7/dist-packages/rpki/gui/app/views.py", line 934, in refresh [Thu Feb 20 09:32:17 2014] [error] request.session['handle']) [Thu Feb 20 09:32:17 2014] [error] File "/usr/lib/python2.7/dist-packages/rpki/gui/app/glue.py", line 83, in list_received_resources [Thu Feb 20 09:32:17 2014] [error] for pdu in pdus: [Thu Feb 20 09:32:17 2014] [error] TypeError: 'NoneType' object is not iterable }}}

In rpki.gui.app.glue.list_received_resources, the call to Zookeeper.call_rpkid() is apparently returning None, rather than a list of result pdus: {{{ z = Zookeeper(handle=conf.handle) pdus = z.call_rpkid(list_received_resources_elt.make_pdu(self_handle=conf.handle)) }}}

Zookeeper.call_rpkid() itself just a wrapper: {{{ call_rpkid = rpki.async.sync_wrapper(rpki.http.caller( proto = rpki.left_right, client_key = irbe.private_key, client_cert = irbe.certificate, server_ta = self.server_ca.certificate, server_cert = rpkid.certificate,
url = url, debug = self.show_xml))

return call_rpkid(*pdus)

}}}

rpki.async.sync_wrapper will potentially raise an exception on error, otherwise it returns the result of the callback, which in this case must really be None: {{{ def call(self, _args, *_kwargs):

def thunk():
  try:
    self.func(self.cb, self.eb, *args, **kwargs)
  except ExitNow:
    raise
  except Exception, e:
    self.eb(e)

event_defer(thunk)
event_loop()
if self.err is None:
  return self.res
elif isinstance(self.err, tuple):
  raise self.err[0], self.err[1], self.err[2]
else:
  raise self.err

}}}

I'm a little stumped as to why this is happening. Currently there is no error check in my code prior to removing the existing resources, which is definitely a bug: {{{ z = Zookeeper(handle=conf.handle) pdus = z.call_rpkid(list_received_resources_elt.make_pdu(self_handle=conf.handle))

models.ResourceCert.objects.filter(conf=conf).delete()

}}}

I need to ensure that pdus is not None and that there are no error pdus present.

Trac comment by melkins on 2014-03-10T17:09:47Z

sraustein commented 10 years ago

An alternative would be to decorate rpki.gui.app.glue.list_received_resources() with @commit_on_success, so that any changes would be rolled back if an exception is raised. This may be the cleaner solution.

Trac comment by melkins on 2014-03-10T17:20:14Z

sraustein commented 10 years ago

In [changeset:"5700"]: {{{

!CommitTicketReference repository="" revision="5700"

Rollback deletion of cached resource certs when there is an error communicating with rpkid. see #681 }}}

Trac comment by melkins on 2014-03-10T17:55:00Z

sraustein commented 10 years ago

will not be able to test soon, still in transit. but thanks for working this one.

Trac comment by randy on 2014-03-10T19:15:12Z

sraustein commented 10 years ago

rpki.async.sync_wrapper will potentially raise an exception on error, otherwise it returns the result of the callback, which in this case must really be None: ... I'm a little stumped as to why this is happening.

Likewise. Empty sequence would be plausible, but None is unexpected.

Trac comment by sra on 2014-03-10T19:30:23Z

sraustein commented 10 years ago

ok, vmini is on net on bainbridge

Trac comment by randy on 2014-03-15T01:13:54Z

sraustein commented 10 years ago

On Sat, Mar 15, 2014 at 01:13:54AM -0000, Trac Ticket System wrote:

ok, vmini is on net on bainbridge

I can ssh to ca0.vmini, but the https service does not seem to be forwarded to that virtual machine?

Trac comment by melkins on 2014-03-18T17:19:05Z

sraustein commented 10 years ago

I can ssh to ca0.vmini, but the https service does not seem to be forwarded to that virtual machine?

have you tried https://vmini.rg.net?

Trac comment by randy on 2014-03-18T17:27:02Z

sraustein commented 10 years ago

On Tue, Mar 18, 2014 at 05:27:02PM -0000, Trac Ticket System wrote:

have you tried https://vmini.rg.net?

Firefox can't find the server at vmini.rg.net.

Trac comment by melkins on 2014-03-18T17:31:56Z

sraustein commented 10 years ago

On Tue, Mar 18, 2014 at 05:31:56PM -0000, Trac Ticket System wrote:

On Tue, Mar 18, 2014 at 05:27:02PM -0000, Trac Ticket System wrote:

have you tried https://vmini.rg.net?

Firefox can't find the server at vmini.rg.net.

$ host vmini.rpki.net vmini.rpki.net has address 50.135.235.34 $ host vmini.rg.net Host vmini.rg.net not found: 3(NXDOMAIN) $ host ca0.vmini.rpki.net ca0.vmini.rpki.net has address 50.135.235.34

Trac comment by melkins on 2014-03-18T17:35:59Z

sraustein commented 10 years ago

Firefox can't find the server at vmini.rg.net.

can't find server, as in ip addy? or can't find 443 service?

xinitd on vmini would seem to redirect properly

service ca0-https { flags = NORETRY type = UNLISTED socket_type = stream protocol = tcp port = 443 wait = no user = root redirect = 10.0.0.2 443 }

i am stuck 'inside' so can not properly test. and rob is dealing with far more serious matters.

lemme see if i can proxy, tunnel, or something.

Trac comment by randy on 2014-03-18T17:36:09Z

sraustein commented 10 years ago

oh sigh. so try https://vmini.rpki.net/

Trac comment by randy on 2014-03-18T17:37:22Z

sraustein commented 10 years ago

On Tue, Mar 18, 2014 at 05:36:10PM -0000, Trac Ticket System wrote:

Comment (by randy):

Firefox can't find the server at vmini.rg.net.

can't find server, as in ip addy? or can't find 443 service?

vmini.rg.net gets a NXDOMAIN. vmini.rpki.net and ca0.vmini.rpki.net get connection refused:

$ telnet ca0.vmini.rpki.net 443 Trying 50.135.235.34... telnet: Unable to connect to remote host: Connection refused

i am stuck 'inside' so can not properly test. and rob is dealing with far more serious matters.

lemme see if i can proxy, tunnel, or something.

OK, not urgent.

Trac comment by melkins on 2014-03-18T17:38:14Z

sraustein commented 10 years ago

and rob is dealing with far more serious matters.

Putting the rear wheel back on my mother's bicycle, yeah.

Trac comment by sra on 2014-03-18T18:22:46Z

sraustein commented 10 years ago

tcptraceroute from ashburn says that the https port is blocked. At a guess, by your edge router in bainbridge or whatever.

{{{ Tracing the path to ca0.vmini.rpki.net (50.135.235.34) on TCP port 443, 30 hops max 1 r0.iad.rg.net (198.180.150.120) 9.825 ms 9.484 ms 9.979 ms 2 r1.iad.rg.net (198.180.150.121) 9.966 ms 9.611 ms 9.980 ms 3 144.232.18.81 (144.232.18.81) 9.983 ms 9.682 ms 9.981 ms 4 sl-crs1-dc-0-6-0-2.sprintlink.net (144.232.20.135) 9.985 ms 9.621 ms 9.982 ms 5 sl-st30-ash-0-10-0-0.sprintlink.net (144.232.25.224) 10.007 ms 9.126 ms 9.979 ms 6 144.232.11.92 (144.232.11.92) 9.986 ms 9.334 ms 9.983 ms 7 144.232.6.98 (144.232.6.98) 9.984 ms 9.685 ms 9.981 ms 8 be-26-cr01.ashburn.va.ibone.comcast.net (68.86.82.169) 9.984 ms 9.969 ms 9.973 ms 9 he-4-6-0-0-cr01.56marietta.ga.ibone.comcast.net (68.86.89.162) 19.403 ms 19.476 ms 19.947 ms 10 68.86.88.238 (68.86.88.238) 39.585 ms 39.584 ms 39.957 ms 11 he-0-13-0-0-cr01.losangeles.ca.ibone.comcast.net (68.86.85.145) 69.946 ms 71.555 ms 72.071 ms 12 he-2-12-0-0-cr01.sanjose.ca.ibone.comcast.net (68.86.89.217) 81.804 ms 83.513 ms 83.937 ms 13 68.86.91.78 (68.86.91.78) 78.684 ms 78.734 ms 78.689 ms 14 * * * 15 te-1-0-0-ten03.bremerton.wa.seattle.comcast.net (68.86.98.122) 80.206 ms 80.276 ms 80.191 ms 16 c-50-135-235-34.hsd1.wa.comcast.net (50.135.235.34) [closed] 91.662 ms 91.734 ms 93.920 ms }}}

If you ssh to vmini.rpki.net, w3m https://ca0.vmini.rpki.net/ works, netstat --inet -a shows that we're listening to the correct port, and w3m https://localhost/ suggests that the xinetd connection forwarding is active, although the hostname mismatch will leave things hopelessly confused in the last case.

Trac comment by sra on 2014-03-18T18:31:23Z

sraustein commented 10 years ago

please try again

Trac comment by randy on 2014-03-18T23:28:14Z

sraustein commented 10 years ago

On Tue, Mar 18, 2014 at 11:28:14PM -0000, Trac Ticket System wrote:

please try again

Now port 443 is open, but I get a HTTP 301 redirect back to port 80, which gets me connection refused.

Trac comment by melkins on 2014-03-19T18:07:04Z

sraustein commented 10 years ago

sigh. again, please

Trac comment by randy on 2014-03-19T18:09:41Z

sraustein commented 10 years ago

sigh. again, please

no dice:

$ telnet cache0.vmini.rpki.net 80 Trying 50.135.235.34... telnet: Unable to connect to remote host: Connection refused

Trac comment by melkins on 2014-03-19T18:10:56Z

sraustein commented 10 years ago

Now port 443 is open, but I get a HTTP 301 redirect back to port 80, which gets me connection refused.

I suspect wrong DNS name.

https://cache0.vmini.rpki.net/ redirects to http://cache0.vmini.rpki.net/

http://ca0.vmini.rpki.net/ redirects to https://ca0.vmini.rpki.net/

This is all much more complex than any of us would have liked, but the root cause is the need to make this all work on vmini's one and only external IP address when it's in a lab environment.

Trac comment by sra on 2014-03-19T18:45:50Z

sraustein commented 10 years ago

On 03/19/2014 11:45 AM, Trac Ticket System wrote:

Now port 443 is open, but I get a HTTP 301 redirect back to port 80, which gets me connection refused.

I suspect wrong DNS name.

https://cache0.vmini.rpki.net/ redirects to http://cache0.vmini.rpki.net/

http://ca0.vmini.rpki.net/ redirects to https://ca0.vmini.rpki.net/

Good catch, I should have been hitting ca0.vmini and the redirection for that site does work.

Trac comment by melkins on 2014-03-19T18:47:52Z

sraustein commented 10 years ago

vmini is soon to pack up to go from bainbridge to tokyo. back on your friday or so.

Trac comment by randy on 2014-03-20T00:12:45Z

sraustein commented 10 years ago

vmini is up back home in tokyo

Trac comment by randy on 2014-03-22T12:13:05Z

sraustein commented 10 years ago

i just realized that melkins may not know the hack to get into *.vmini when it is behind my home gateway

in ~/.ssh/conf {{{ Host vmini* Hostname vmini.rpki.net Port 42022 ControlMaster no

Host ca0.vmini* Hostname vmini.rpki.net Port 42023 ControlMaster no

Host cache0.vmini* Hostname vmini.rpki.net Port 42024 ControlMaster no }}}

Trac comment by randy on 2014-04-08T23:43:08Z

sraustein commented 10 years ago

Replying to [comment:9 melkins]:

In rpki.gui.app.glue.list_received_resources, the call to Zookeeper.call_rpkid() is apparently returning None, rather than a list of result pdus: ... I'm a little stumped as to why this is happening.

Likewise. The wrapper code looks pretty tight, although of course I could be missing something.

The only path I could see that might cause this would be if your process were to receive a SIGHUP or SIGTERM while waiting for the answer from rpkid.

Guess I could add something to the sync_wrapper class to record whether either of the exit callback methods was ever invoked.

Trac comment by sra on 2014-07-11T04:29:31Z

sraustein commented 10 years ago

In [changeset:"5890"]: {{{

!CommitTicketReference repository="" revision="5890"

Add check to log a warning if sync_wrapper returns without callback. See #681. }}}

Trac comment by sra on 2014-07-11T16:59:28Z

sraustein commented 9 years ago

THIS PROBLEM PERSISTS AND BLEW A WORKSHOP LAB

labuser05 had no resources. tried to export workshop's prefixes, manually add labuser05, and re-import. internal server error.

Trac comment by randy on 2014-11-19T10:35:07Z

sraustein commented 9 years ago

Replying to [comment:35 randy]:

THIS PROBLEM PERSISTS AND BLEW A WORKSHOP LAB

labuser05 had no resources. tried to export workshop's prefixes, manually add labuser05, and re-import. internal server error.

This looks like a combination of the bug for this ticket and other bugs (see #734 #735 #736).

These log messages indicate that the problem I reported earlier in this ticket are still present. Namely, Zookeeper.call_rpkid() with a list_received_resources PDU sometimes returns None unexpectedly, rather than the list of resources. This no long causes an exception to be thrown, but it does mean the GUI has no idea what resources a holder actually has, and falls back on data cached in the django DB. {{{ [Wed Nov 19 09:11:31.130046 2014] [:error] [pid 12838:tid 2962180928] [client 10.0.0.1:34363] error: call_rpkid() returned None for handle labuser08 when fetching received resources, referer: https://ca0.vmini.rpki.net/rpki/ [Wed Nov 19 09:12:04.399980 2014] [:error] [pid 12838:tid 2886646592] [client 10.0.0.1:34382] error: call_rpkid() returned None for handle labuser05 when fetching received resources, referer: https://ca0.vmini.rpki.net/rpki/ [Wed Nov 19 09:15:15.965334 2014] [:error] [pid 12837:tid 3046107968] [client 10.0.0.1:34490] error: call_rpkid() returned None for handle labuser03 when fetching received resources, referer: https://ca0.vmini.rpki.net/rpki/ [Wed Nov 19 09:16:10.760989 2014] [:error] [pid 12837:tid 2903432000] [client 10.0.0.1:34521] error: call_rpkid() returned None for handle labuser06 when fetching received resources, referer: https://ca0.vmini.rpki.net/rpki/ [Wed Nov 19 09:16:47.372733 2014] [:error] [pid 12837:tid 2970573632] [client 10.0.0.1:34540] error: call_rpkid() returned None for handle labuser11 when fetching received resources, referer: https://ca0.vmini.rpki.net/rpki/ [Wed Nov 19 10:22:07.714532 2014] [:error] [pid 12837:tid 3020929856] [client 10.0.0.1:34675] error: call_rpkid() returned None for handle labuser05 when fetching received resources, referer: https://ca0.vmini.rpki.net/rpki/ [Wed Nov 19 10:24:45.936055 2014] [:error] [pid 12837:tid 3012537152] [client 10.0.0.1:34691] error: call_rpkid() returned None for handle labuser13 when fetching received resources, referer: https://ca0.vmini.rpki.net/rpki/ [Wed Nov 19 10:31:43.798281 2014] [:error] [pid 12837:tid 2895039296] [client 10.0.0.1:34743] error: call_rpkid() returned None for handle labuser16 when fetching received resources, referer: https://ca0.vmini.rpki.net/rpki/ [Wed Nov 19 10:31:47.432379 2014] [:error] [pid 12837:tid 2895039296] [client 10.0.0.1:34743] error: call_rpkid() returned None for handle labuser16 when fetching received resources, referer: https://ca0.vmini.rpki.net/rpki/ [Wed Nov 19 10:47:31.325763 2014] [:error] [pid 12837:tid 2895039296] [client 10.0.0.1:34905] error: call_rpkid() returned None for handle labuser08 when fetching received resources, referer: https://ca0.vmini.rpki.net/rpki/ [Wed Nov 19 10:47:35.243230 2014] [:error] [pid 12837:tid 2895039296] [client 10.0.0.1:34905] error: call_rpkid() returned None for handle labuser08 when fetching received resources, referer: https://ca0.vmini.rpki.net/rpki/ [Wed Nov 19 10:47:37.189831 2014] [:error] [pid 12837:tid 2895039296] [client 10.0.0.1:34905] error: call_rpkid() returned None for handle labuser08 when fetching received resources, referer: https://ca0.vmini.rpki.net/rpki/ [Wed Nov 19 10:51:07.512430 2014] [:error] [pid 12837:tid 2945395520] [client 10.0.0.1:34941] error: call_rpkid() returned None for handle labuser04 when fetching received resources, referer: https://ca0.vmini.rpki.net/rpki/ [Wed Nov 19 10:52:24.503128 2014] [:error] [pid 12837:tid 2886646592] [client 10.0.0.1:34958] error: call_rpkid() returned None for handle labuser04 when fetching received resources, referer: https://ca0.vmini.rpki.net/rpki/ [Wed Nov 19 10:52:26.458218 2014] [:error] [pid 12837:tid 2886646592] [client 10.0.0.1:34958] error: call_rpkid() returned None for handle labuser04 when fetching received resources, referer: https://ca0.vmini.rpki.net/rpki/ [Wed Nov 19 11:06:00.472249 2014] [:error] [pid 12837:tid 3020929856] [client 10.0.0.1:34984] error: call_rpkid() returned None for handle labuser10 when fetching received resources, referer: https://ca0.vmini.rpki.net/rpki/ }}}

I am unable to trigger the problem with irbe_client, as it seems to always return data.

Looking at rpki.async.sync_wrapper, I should probably enable logging for rpki.async to see if rpki.async.sync_wrapper.cb() is ever invoked. Maybe that will give me some additional clue.

The only other error messages I'm seeing in the apache log are mod_wsgi reporting that it is ignoring rpki.async's request to trip some UNIX signals:

{{{ [Wed Nov 19 05:29:01.290630 2014] [:warn] [pid 12837:tid 3020929856] mod_wsgi (pid=12837): Callback registration for signal 2 ignored. [Wed Nov 19 05:29:01.313821 2014] [:warn] [pid 12837:tid 3020929856] File "/usr/lib/python2.7/dist-packages/django/core/handlers/wsgi.py", line 206, in call [Wed Nov 19 05:29:01.313861 2014] [:warn] [pid 12837:tid 3020929856] response = self.get_response(request) [Wed Nov 19 05:29:01.313880 2014] [:warn] [pid 12837:tid 3020929856] File "/usr/lib/python2.7/dist-packages/django/core/handlers/base.py", line 112, in get_response [Wed Nov 19 05:29:01.313895 2014] [:warn] [pid 12837:tid 3020929856] response = wrapped_callback(request, _callback_args, _callback_kwargs) [Wed Nov 19 05:29:01.313914 2014] [:warn] [pid 12837:tid 3020929856] File "/usr/lib/python2.7/dist-packages/django/contrib/auth/decorators.py", line 22, in _wrapped_view [Wed Nov 19 05:29:01.313930 2014] [:warn] [pid 12837:tid 3020929856] return view_func(request, _args, _kwargs) [Wed Nov 19 05:29:01.313948 2014] [:warn] [pid 12837:tid 3020929856] File "/usr/lib/python2.7/dist-packages/rpki/gui/decorators.py", line 30, in _tls_required [Wed Nov 19 05:29:01.313964 2014] [:warn] [pid 12837:tid 3020929856] return f(request, _args, _kwargs) [Wed Nov 19 05:29:01.313981 2014] [:warn] [pid 12837:tid 3020929856] File "/usr/lib/python2.7/dist-packages/rpki/gui/app/views.py", line 104, in wrapped_fn [Wed Nov 19 05:29:01.313996 2014] [:warn] [pid 12837:tid 3020929856] return f(request, _args, _kwargs) [Wed Nov 19 05:29:01.314013 2014] [:warn] [pid 12837:tid 3020929856] File "/usr/lib/python2.7/dist-packages/rpki/gui/app/views.py", line 930, in refresh [Wed Nov 19 05:29:01.314029 2014] [:warn] [pid 12837:tid 3020929856] glue.list_received_resources(request.META['wsgi.errors'], conf) [Wed Nov 19 05:29:01.314045 2014] [:warn] [pid 12837:tid 3020929856] File "/usr/lib/python2.7/dist-packages/django/db/transaction.py", line 399, in inner [Wed Nov 19 05:29:01.314060 2014] [:warn] [pid 12837:tid 3020929856] return func(_args, _kwargs) [Wed Nov 19 05:29:01.314077 2014] [:warn] [pid 12837:tid 3020929856] File "/usr/lib/python2.7/dist-packages/rpki/gui/app/glue.py", line 93, in list_received_resources [Wed Nov 19 05:29:01.314093 2014] [:warn] [pid 12837:tid 3020929856] pdus = z.call_rpkid(list_received_resources_elt.make_pdu(self_handle=conf.handle)) [Wed Nov 19 05:29:01.314110 2014] [:warn] [pid 12837:tid 3020929856] File "/usr/lib/python2.7/dist-packages/rpki/irdb/zookeeper.py", line 1071, in call_rpkid [Wed Nov 19 05:29:01.314125 2014] [:warn] [pid 12837:tid 3020929856] return call_rpkid(pdus) [Wed Nov 19 05:29:01.314142 2014] [:warn] [pid 12837:tid 3020929856] File "/usr/lib/python2.7/dist-packages/rpki/async.py", line 377, in call [Wed Nov 19 05:29:01.314157 2014] [:warn] [pid 12837:tid 3020929856] event_loop() [Wed Nov 19 05:29:01.314174 2014] [:warn] [pid 12837:tid 3020929856] File "/usr/lib/python2.7/dist-packages/rpki/async.py", line 290, in event_loop [Wed Nov 19 05:29:01.314189 2014] [:warn] [pid 12837:tid 3020929856] old = signal.signal(sig, _raiseExitNow) [Wed Nov 19 05:29:01.314217 2014] [:warn] [pid 12837:tid 3020929856] mod_wsgi (pid=12837): Callback registration for signal 15 ignored. [Wed Nov 19 05:29:01.314487 2014] [:warn] [pid 12837:tid 3020929856] File "/usr/lib/python2.7/dist-packages/django/core/handlers/wsgi.py", line 206, in call [Wed Nov 19 05:29:01.314543 2014] [:warn] [pid 12837:tid 3020929856] response = self.get_response(request) [Wed Nov 19 05:29:01.314560 2014] [:warn] [pid 12837:tid 3020929856] File "/usr/lib/python2.7/dist-packages/django/core/handlers/base.py", line 112, in get_response [Wed Nov 19 05:29:01.314575 2014] [:warn] [pid 12837:tid 3020929856] response = wrapped_callback(request, _callback_args, _callback_kwargs) [Wed Nov 19 05:29:01.314590 2014] [:warn] [pid 12837:tid 3020929856] File "/usr/lib/python2.7/dist-packages/django/contrib/auth/decorators.py", line 22, in _wrapped_view [Wed Nov 19 05:29:01.314605 2014] [:warn] [pid 12837:tid 3020929856] return view_func(request, _args, _kwargs) [Wed Nov 19 05:29:01.314621 2014] [:warn] [pid 12837:tid 3020929856] File "/usr/lib/python2.7/dist-packages/rpki/gui/decorators.py", line 30, in _tls_required [Wed Nov 19 05:29:01.314634 2014] [:warn] [pid 12837:tid 3020929856] return f(request, _args, _kwargs) [Wed Nov 19 05:29:01.314651 2014] [:warn] [pid 12837:tid 3020929856] File "/usr/lib/python2.7/dist-packages/rpki/gui/app/views.py", line 104, in wrapped_fn [Wed Nov 19 05:29:01.314666 2014] [:warn] [pid 12837:tid 3020929856] return f(request, _args, _kwargs) [Wed Nov 19 05:29:01.314684 2014] [:warn] [pid 12837:tid 3020929856] File "/usr/lib/python2.7/dist-packages/rpki/gui/app/views.py", line 930, in refresh [Wed Nov 19 05:29:01.314699 2014] [:warn] [pid 12837:tid 3020929856] glue.list_received_resources(request.META['wsgi.errors'], conf) [Wed Nov 19 05:29:01.314713 2014] [:warn] [pid 12837:tid 3020929856] File "/usr/lib/python2.7/dist-packages/django/db/transaction.py", line 399, in inner [Wed Nov 19 05:29:01.314725 2014] [:warn] [pid 12837:tid 3020929856] return func(_args, _kwargs) [Wed Nov 19 05:29:01.314738 2014] [:warn] [pid 12837:tid 3020929856] File "/usr/lib/python2.7/dist-packages/rpki/gui/app/glue.py", line 93, in list_received_resources [Wed Nov 19 05:29:01.314753 2014] [:warn] [pid 12837:tid 3020929856] pdus = z.call_rpkid(list_received_resources_elt.make_pdu(self_handle=conf.handle)) [Wed Nov 19 05:29:01.314770 2014] [:warn] [pid 12837:tid 3020929856] File "/usr/lib/python2.7/dist-packages/rpki/irdb/zookeeper.py", line 1071, in call_rpkid [Wed Nov 19 05:29:01.314783 2014] [:warn] [pid 12837:tid 3020929856] return call_rpkid(pdus) [Wed Nov 19 05:29:01.314799 2014] [:warn] [pid 12837:tid 3020929856] File "/usr/lib/python2.7/dist-packages/rpki/async.py", line 377, in call [Wed Nov 19 05:29:01.314813 2014] [:warn] [pid 12837:tid 3020929856] event_loop() [Wed Nov 19 05:29:01.314828 2014] [:warn] [pid 12837:tid 3020929856] File "/usr/lib/python2.7/dist-packages/rpki/async.py", line 290, in event_loop [Wed Nov 19 05:29:01.314843 2014] [:warn] [pid 12837:tid 3020929856] old = signal.signal(sig, _raiseExitNow) }}}

but I don't see any obvious sign that this is causing the problem reported here.

Trac comment by melkins on 2014-11-19T18:28:22Z

sraustein commented 9 years ago

In [changeset:"6026"]: {{{

!CommitTicketReference repository="" revision="6026"

enable logging rpki.async to the apache log file. see #681 }}}

Trac comment by melkins on 2014-11-19T18:36:53Z

sraustein commented 9 years ago

The signal.signal() calls are setting up signal handlers for SIGTERM etc so that signals sent to this process will cause a clean exit from the asynchronous I/O system. In theory nothing bad should happen if they aren't set and nobody sends those signals.

Looking at the context in which this is happening (rpki.async.event_loop()), it looks like the exception you're seeing while trying to set the signal handler will cause event_loop() to exit without ever running the event loop itself, which may be the real problem.

Things this theory does not explain:

If we think that mod_wsgi really just does not want us to set signal handlers, it would be easy enough to add an option somewhere in the rpki.async.sync_wrapper() class which passes an empty tuple of signal numbers to event_loop().

Does the mod_wsgi doc have any comment on signal handlers?

Trac comment by sra on 2014-11-19T19:02:32Z

sraustein commented 9 years ago

On 11/19/2014 11:02 AM, Trac Ticket System wrote:

The signal.signal() calls are setting up signal handlers for SIGTERM etc so that signals sent to this process will cause a clean exit from the asynchronous I/O system. In theory nothing bad should happen if they aren't set and nobody sends those signals.

Looking at the context in which this is happening (rpki.async.event_loop()), it looks like the exception you're seeing while trying to set the signal handler will cause event_loop() to exit without ever running the event loop itself, which may be the real problem.

Things this theory does not explain:

  • Why does this only happen sometimes rather than for all GUI instances?
  • Why is there nothing in the log from the catch-all logger.exception() call towards the end of this function?

I've only selectively enabled logging in rpki.gui.default_settings.py. Just pushed a change to turn it on for rpki.async. Are there any other modules that should be enabled here as well?

https://trac.rpki.net/changeset/6026

If we think that mod_wsgi really just does not want us to set signal handlers, it would be easy enough to add an option somewhere in the rpki.async.sync_wrapper() class which passes an empty tuple of signal numbers to event_loop().

Does the mod_wsgi doc have any comment on signal handlers?

https://code.google.com/p/modwsgi/wiki/ApplicationIssues#Registration_Of_Signal_Handlers

Trac comment by melkins on 2014-11-19T19:18:27Z

sraustein commented 9 years ago

In [changeset:"6027"]: {{{

!CommitTicketReference repository="" revision="6027"

Add disable_signal_handlers option to rpki.async.sync_wrapper and rpki.irdb.Zookeeper() to avoid problems when running in environments like mod_wsgi which have strong opinions about what code is and is not allowed to handle signals. This may result in some noise in log files, but there's not much we can do about that if we're not allowed to clean up when terminated, so be it. See #681. }}}

Trac comment by sra on 2014-11-19T19:52:55Z

sraustein commented 9 years ago

Just committed [6027], which adds a disable_signal_handlers option to Zookeeper.__init__() and sync_wrapper.__init__(); the latter does the real work, the former is just a pass-through, since nothing but zookeeper methods ever use sync_wrapper directly anymore.

The default behavior is not to disable signal handlers, that is, to behave as currently, because this is correct for everything but the GUI.

In theory, all you should need to do to use this in the GUI is set that option when creating your Zookeeper instance, after that it should be automatic.

Be warned that disabling these signal handlers may generate some noise in the log file. I don't //think// it will do anything else bad, since the purpose of these signal handlers is mostly just to translate one kind of termination signal into another one. I don't recall whether the asyncore I/O loop will exit (cleanly or at all) without these signal handlers, but since in the GUI case the entire Apache environment is shutting down if you receive one of these, it probably doesn't matter very much, it's not like anything is going to hang about waiting for a clean shutdown.

Trac comment by sra on 2014-11-19T19:59:47Z

sraustein commented 9 years ago

I've only selectively enabled logging in rpki.gui.default_settings.py. Just pushed a change to turn it on for rpki.async. Are there any other modules that should be enabled here as well?

The code kind of assumes that anything logged at severity warning or worse will be logged, regardless of module. Certainly anything logging at severity error should be logged.

See rpki.log for how logging is configured these days in all the non-GUI code.

Trac comment by sra on 2014-11-19T20:13:46Z

sraustein commented 9 years ago

In [changeset:"6029"]: {{{

!CommitTicketReference repository="" revision="6029"

configure root logger to route WARNING or above to the appropriate handlers. see #681 }}}

Trac comment by melkins on 2014-11-19T23:11:28Z

sraustein commented 9 years ago

bottom line: o vmini is not where you can reach it now. will be monday o should i update now and try again?

Trac comment by randy on 2014-11-20T00:55:14Z

sraustein commented 9 years ago

bottom line: o vmini is not where you can reach it now. will be monday o should i update now and try again?

Not quite yet.

I added a new initialization option to the Zookeeper class to humor mod_wsgi, but that option is still off by default, because we don't want it enabled for anything except the GUI. So Michael (or somebody) has to go through the GUI code enabling that option in the cases were we need it.

Once that's done, Michael (or somebody) has had time to test the result, and the build robot has time to build new packages, it will be time for you to upgrade.

Trac comment by sra on 2014-11-20T01:05:07Z

sraustein commented 9 years ago

ok. fwiw, vmini will be shut down in about 24 hours and be down for three days.

Trac comment by randy on 2014-11-20T03:00:18Z

sraustein commented 9 years ago

As of now, the GUI will now:

The new code won't be available in the .deb packages until after tonight's build.

Trac comment by melkins on 2014-11-20T18:08:50Z

sraustein commented 9 years ago

The new code won't be available in the .deb packages until after tonight's build.

Er, there is no "tonight's build".

The build robot checks for updates every ten minutes. It does take a little while to run through all the builds, after which it uploads them all to the server at once. As I type this it's about halfway through building all the 0.6032 packages.

Trac comment by sra on 2014-11-20T18:34:57Z

sraustein commented 9 years ago

vmini is back up in tokyo. may take ten mins for dns to propagate

Trac comment by randy on 2014-11-23T10:18:14Z

sraustein commented 9 years ago

vmini is currently at rpki-ca 0.6030. It should be upgraded to the current (0.6035).

I'm trying to figure out how to appropriately test this bug. We only see it during workshops, so I'm pondering if there is some concurrency issue. My test script (~root/tk681.ssh) iterates over all the labuser accounts and doesn't seem to tickle the bug when done serially.

Trac comment by melkins on 2014-11-25T20:34:10Z