StackStorm / st2

StackStorm (aka "IFTTT for Ops") is event-driven automation for auto-remediation, incident responses, troubleshooting, deployments, and more for DevOps and SREs. Includes rules engine, workflow, 160 integration packs with 6000+ actions (see https://exchange.stackstorm.org) and ChatOps. Installer at https://docs.stackstorm.com/install/index.html
https://stackstorm.com/
Apache License 2.0
6.04k stars 746 forks source link

Stackstorm fails to start 4 components if there are multiple PTR records associated with server #4161

Open emptywee opened 6 years ago

emptywee commented 6 years ago
ISSUE TYPE
STACKSTORM VERSION
st2 2.7.2, on Python 2.7
OS / ENVIRONMENT / INSTALL METHOD

CentOS 6, custom install

SUMMARY

If the server where you run st2 has multiple PTR records associated with, few components fail to start: st2notifier st2rulesengine st2garbagecollector (forgot which one was the 4th)

Reason is socket.fqdn() returns different name each time invoked (round robin, probably). And Stackstorm is registering options twice on start. That creates a DuplicateOptError exception.

STEPS TO REPRODUCE
1. Add one or more PTR records for the IP address of your server. So it looks like this:
$ host 10.142.18.192
192.18.142.10.in-addr.arpa domain name pointer server1.example.com.
192.18.142.10.in-addr.arpa domain name pointer www.example.com.
192.18.142.10.in-addr.arpa domain name pointer ftp.example.com.

2. Try to start /opt/stackstorm/st2/bin/st2notifier --help
EXPECTED RESULTS

It should print help with no errors and exceptions.

ACTUAL RESULTS
$ /opt/stackstorm/st2/bin/st2notifier --help
Traceback (most recent call last):
  File "./st2notifier", line 18, in <module>
    from st2actions.cmd import st2notifier
  File "/opt/stackstorm/st2/lib/python2.7/site-packages/st2actions/cmd/st2notifier.py", line 13, in <module>
    from st2actions.notifier import config
  File "/opt/stackstorm/st2/lib/python2.7/site-packages/st2actions/notifier/config.py", line 60, in <module>
    register_opts()
  File "/opt/stackstorm/st2/lib/python2.7/site-packages/st2actions/notifier/config.py", line 31, in register_opts
    _register_common_opts()
  File "/opt/stackstorm/st2/lib/python2.7/site-packages/st2actions/notifier/config.py", line 40, in _register_common_opts
    common_config.register_opts()
  File "/opt/stackstorm/st2/lib/python2.7/site-packages/st2common/config.py", line 117, in register_opts
    do_register_opts(webui_opts, 'webui', ignore_errors)
  File "/opt/stackstorm/st2/lib/python2.7/site-packages/st2common/config.py", line 30, in do_register_opts
    cfg.CONF.register_opts(opts, group=group)
  File "/opt/stackstorm/st2/lib/python2.7/site-packages/oslo_config/cfg.py", line 1820, in __inner
    result = f(self, *args, **kwargs)
  File "/opt/stackstorm/st2/lib/python2.7/site-packages/oslo_config/cfg.py", line 1977, in register_opts
    self.register_opt(opt, group, clear_cache=False)
  File "/opt/stackstorm/st2/lib/python2.7/site-packages/oslo_config/cfg.py", line 1824, in __inner
    return f(self, *args, **kwargs)
  File "/opt/stackstorm/st2/lib/python2.7/site-packages/oslo_config/cfg.py", line 1961, in register_opt
    return group._register_opt(opt, cli)
  File "/opt/stackstorm/st2/lib/python2.7/site-packages/oslo_config/cfg.py", line 1344, in _register_opt
    if _is_opt_registered(self._opts, opt):
  File "/opt/stackstorm/st2/lib/python2.7/site-packages/oslo_config/cfg.py", line 574, in _is_opt_registered
    raise DuplicateOptError(opt.name)
oslo_config.cfg.DuplicateOptError: duplicate option: webui_base_url
Kami commented 6 years ago

Thanks for reporting that.

That's an interesting error we haven't encountered yet before. I wonder if it's some how related to eventlet since recent versions started bundling dnspython library which might have a bug or similar.

As far as the actual error goes - I need to dig in, I see no reason why options would be registered twice, at least not because of DNS related setup (could be something else going on).

Kami commented 6 years ago

@emptywee I was unable to replicate this issue.

I wonder if it's some how related to this issue - https://github.com/StackStorm/st2/pull/4111 (e.g. there are multiple config files laying around and it tries to load and parse multiple ones, in addition to the default location of /etc/st2/st2.conf. That's unlikely because the values should be overwritten / merged, but still).

Do you also get that error if you explicitly pass --config-file /etc/st2/st2.conf argument to the service? Would also be good to see your st2.conf file with secret / private items masked. And please also post output of running the service with --debug flag.

emptywee commented 6 years ago

@Kami I don't think it's related to #4111 .

So, here's the output with --debug flag:

$ sudo /opt/stackstorm/st2/bin/st2notifier --config-file=/etc/st2/st2.conf --debug
2018-06-04 09:03:54,105 DEBUG [-] Using config files: /etc/st2/st2.conf
2018-06-04 09:03:54,105 DEBUG [-] Using logging config: /etc/st2/logging.notifier.conf
2018-06-04 09:03:54,109 INFO [-] Connecting to database "st2" @ "127.0.0.1:27017" as user "None".
2018-06-04 09:03:54,112 DEBUG [-] Ensuring database indexes...
2018-06-04 09:03:54,198 DEBUG [-] Skipping index cleanup for blacklisted model "PermissionGrantDB"...
2018-06-04 09:03:54,233 DEBUG [-] Indexes are ensured for models: ActionAliasDB, ActionAliasDB, ActionDB, ActionExecutionDB, ActionExecutionDB, ActionExecutionOutputDB, ActionExecutionStateDB, ActionExecutionStateDB, ApiKeyDB, ConfigDB, ConfigSchemaDB, GroupToRoleMappingDB, KeyValuePairDB, LiveActionDB, LiveActionDB, PackDB, PermissionGrantDB, PolicyDB, PolicyTypeDB, RoleDB, RuleDB, RuleEnforcementDB, RunnerTypeDB, RunnerTypeDB, SensorTypeDB, TokenDB, TraceDB, TriggerDB, TriggerInstanceDB, TriggerTypeDB, UserDB, UserRoleAssignmentDB
2018-06-04 09:03:54,234 DEBUG [-] Registering exchanges...
2018-06-04 09:03:54,247 DEBUG [-] Start from server, version: 0.9, properties: {'information': 'Licensed under the MPL.  See http://www.rabbitmq.com/', 'product': 'RabbitMQ', 'copyright': 'Copyright (C) 2007-2016 Pivotal Software, Inc.', 'capabilities': {'exchange_exchange_bindings': True, 'connection.blocked': True, 'authentication_failure_close': True, 'direct_reply_to': True, 'basic.nack': True, 'per_consumer_qos': True, 'consumer_priorities': True, 'consumer_cancel_notify': True, 'publisher_confirms': True}, 'cluster_name': 'rabbit@lvcops101.example.com', 'platform': 'Erlang/OTP', 'version': '3.6.9'}, mechanisms: ['AMQPLAIN', 'PLAIN'], locales: [u'en_US']
2018-06-04 09:03:54,248 DEBUG [-] using channel_id: 1
2018-06-04 09:03:54,250 DEBUG [-] Channel open
2018-06-04 09:03:54,252 DEBUG [-] Registered exchange st2.actionexecutionstate ({'nowait': False, 'exchange': 'st2.actionexecutionstate', 'durable': True, 'passive': False, 'arguments': None, 'type': 'topic', 'auto_delete': False}).
2018-06-04 09:03:54,252 DEBUG [-] Registered exchange st2.announcement ({'nowait': False, 'exchange': 'st2.announcement', 'durable': True, 'passive': False, 'arguments': None, 'type': 'topic', 'auto_delete': False}).
2018-06-04 09:03:54,253 DEBUG [-] Registered exchange st2.execution ({'nowait': False, 'exchange': 'st2.execution', 'durable': True, 'passive': False, 'arguments': None, 'type': 'topic', 'auto_delete': False}).
2018-06-04 09:03:54,253 DEBUG [-] Registered exchange st2.liveaction ({'nowait': False, 'exchange': 'st2.liveaction', 'durable': True, 'passive': False, 'arguments': None, 'type': 'topic', 'auto_delete': False}).
2018-06-04 09:03:54,254 DEBUG [-] Registered exchange st2.liveaction.status ({'nowait': False, 'exchange': 'st2.liveaction.status', 'durable': True, 'passive': False, 'arguments': None, 'type': 'topic', 'auto_delete': False}).
2018-06-04 09:03:54,255 DEBUG [-] Registered exchange st2.trigger ({'nowait': False, 'exchange': 'st2.trigger', 'durable': True, 'passive': False, 'arguments': None, 'type': 'topic', 'auto_delete': False}).
2018-06-04 09:03:54,255 DEBUG [-] Registered exchange st2.trigger_instances_dispatch ({'nowait': False, 'exchange': 'st2.trigger_instances_dispatch', 'durable': True, 'passive': False, 'arguments': None, 'type': 'topic', 'auto_delete': False}).
2018-06-04 09:03:54,256 DEBUG [-] Registered exchange st2.sensor ({'nowait': False, 'exchange': 'st2.sensor', 'durable': True, 'passive': False, 'arguments': None, 'type': 'topic', 'auto_delete': False}).
2018-06-04 09:03:54,256 DEBUG [-] Closed channel #1
2018-06-04 09:03:54,257 DEBUG [-] using channel_id: 1
2018-06-04 09:03:54,258 DEBUG [-] Channel open
2018-06-04 09:03:54,258 DEBUG [-] Predeclaring queue for exchange "st2.liveaction.status"
2018-06-04 09:03:54,260 DEBUG [-] Predeclared queue for exchange "st2.liveaction.status"
2018-06-04 09:03:54,261 DEBUG [-] Predeclaring queue for exchange "st2.liveaction.status"
2018-06-04 09:03:54,262 DEBUG [-] Predeclared queue for exchange "st2.liveaction.status"
2018-06-04 09:03:54,263 DEBUG [-] Predeclaring queue for exchange "st2.liveaction.status"
2018-06-04 09:03:54,264 DEBUG [-] Predeclared queue for exchange "st2.liveaction.status"
2018-06-04 09:03:54,264 DEBUG [-] Predeclaring queue for exchange "st2.execution"
2018-06-04 09:03:54,265 DEBUG [-] Predeclared queue for exchange "st2.execution"
2018-06-04 09:03:54,266 DEBUG [-] Predeclaring queue for exchange "st2.actionexecutionstate"
2018-06-04 09:03:54,267 DEBUG [-] Predeclared queue for exchange "st2.actionexecutionstate"
2018-06-04 09:03:54,267 DEBUG [-] Predeclaring queue for exchange "st2.trigger_instances_dispatch"
2018-06-04 09:03:54,269 DEBUG [-] Predeclared queue for exchange "st2.trigger_instances_dispatch"
2018-06-04 09:03:54,269 DEBUG [-] Predeclaring queue for exchange "st2.announcement"
2018-06-04 09:03:54,272 DEBUG [-] Predeclared queue for exchange "st2.announcement"
2018-06-04 09:03:54,272 DEBUG [-] Predeclaring queue for exchange "st2.execution"
2018-06-04 09:03:54,275 DEBUG [-] Predeclared queue for exchange "st2.execution"
2018-06-04 09:03:54,276 DEBUG [-] Predeclaring queue for exchange "st2.liveaction"
2018-06-04 09:03:54,278 DEBUG [-] Predeclared queue for exchange "st2.liveaction"
2018-06-04 09:03:54,279 DEBUG [-] Predeclaring queue for exchange "st2.execution.output"
2018-06-04 09:03:54,282 DEBUG [-] Predeclared queue for exchange "st2.execution.output"
2018-06-04 09:03:54,282 DEBUG [-] Predeclaring queue for exchange "st2.trigger"
2018-06-04 09:03:54,284 DEBUG [-] Predeclared queue for exchange "st2.trigger"
2018-06-04 09:03:54,284 DEBUG [-] Predeclaring queue for exchange "st2.sensor"
2018-06-04 09:03:54,286 DEBUG [-] Predeclared queue for exchange "st2.sensor"
2018-06-04 09:03:54,286 DEBUG [-] Closed channel #1
2018-06-04 09:03:54,302 DEBUG [-] Conflict while trying to save in DB: Could not save document (E11000 duplicate key error index: st2.role_d_b.$name_1 dup key: { : "admin" }).
2018-06-04 09:03:54,304 DEBUG [-] Conflict while trying to save in DB: Could not save document (E11000 duplicate key error index: st2.role_d_b.$name_1 dup key: { : "observer" }).
2018-06-04 09:03:54,305 DEBUG [-] Conflict while trying to save in DB: Could not save document (E11000 duplicate key error index: st2.role_d_b.$name_1 dup key: { : "system_admin" }).
2018-06-04 09:03:54,305 INFO [-] (PID=30344) Actions notifier started.
2018-06-04 09:03:54,324 INFO [-] Adding job tentatively -- it will be properly scheduled when the scheduler starts
2018-06-04 09:03:54,325 INFO [-] Scheduler is enabled.
2018-06-04 09:03:54,325 INFO [-] Starting Notifier...
2018-06-04 09:03:54,326 INFO [-] Added job "recover_delayed_executions" to job store "default"
2018-06-04 09:03:54,326 INFO [-] Scheduler started
2018-06-04 09:03:54,326 DEBUG [-] Looking for jobs to run
2018-06-04 09:03:54,334 INFO [-] Running job "recover_delayed_executions (trigger: interval[0:05:00], next run at: 2018-06-04 16:03:54 UTC)" (scheduled at 2018-06-04 16:03:54.324199+00:00)
2018-06-04 09:03:54,334 WARNING [-] Coordination backend is not configured. Code paths which use coordination service will use best effort approach and race conditions are possible.
2018-06-04 09:03:54,337 DEBUG [-] Start from server, version: 0.9, properties: {'information': 'Licensed under the MPL.  See http://www.rabbitmq.com/', 'product': 'RabbitMQ', 'copyright': 'Copyright (C) 2007-2016 Pivotal Software, Inc.', 'capabilities': {'exchange_exchange_bindings': True, 'connection.blocked': True, 'authentication_failure_close': True, 'direct_reply_to': True, 'basic.nack': True, 'per_consumer_qos': True, 'consumer_priorities': True, 'consumer_cancel_notify': True, 'publisher_confirms': True}, 'cluster_name': 'rabbit@lvcops101.example.com', 'platform': 'Erlang/OTP', 'version': '3.6.9'}, mechanisms: ['AMQPLAIN', 'PLAIN'], locales: [u'en_US']
2018-06-04 09:03:54,338 DEBUG [-] Next wakeup is due at 2018-06-04 09:08:54.324199-07:00 (in 299.997316 seconds)
2018-06-04 09:03:54,338 INFO [-] Job "recover_delayed_executions (trigger: interval[0:05:00], next run at: 2018-06-04 16:08:54 UTC)" executed successfully
2018-06-04 09:03:54,339 INFO [-] Connected to amqp://guest:**@127.0.0.1:5672//
2018-06-04 09:03:54,340 DEBUG [-] using channel_id: 1
2018-06-04 09:03:54,341 DEBUG [-] Channel open
^C2018-06-04 09:04:14,548 INFO [-] (PID=30344) Actions notifier stopped.
2018-06-04 09:04:14,550 INFO [-] Scheduler has been shut down
2018-06-04 09:04:14,551 INFO [-] Shutting down Notifier...
2018-06-04 09:04:14,579 DEBUG [-] Looking for jobs to run
2018-06-04 09:04:14,579 DEBUG [-] No jobs; waiting until a job is added

Nothing out of ordinary, as this box has one PTR record. But, if I add the following lines to /opt/stackstorm/st2/lib/python2.7/site-packages/st2common/config.py:

def register_opts(ignore_errors=False):
    print "register_opts called"
    import sys, traceback
    traceback.print_stack()
...

I see stack trace printed twice:

$ sudo /opt/stackstorm/st2/bin/st2notifier --config-file=/etc/st2/st2.conf --debug
register_opts called
  File "/opt/stackstorm/st2/bin/st2notifier", line 18, in <module>
    from st2actions.cmd import st2notifier
  File "/opt/stackstorm/st2/lib/python2.7/site-packages/st2actions/cmd/st2notifier.py", line 13, in <module>
    from st2actions.notifier import config
  File "/opt/stackstorm/st2/lib/python2.7/site-packages/st2actions/notifier/config.py", line 21, in <module>
    common_config.register_opts()
  File "/opt/stackstorm/st2/lib/python2.7/site-packages/st2common/config.py", line 55, in register_opts
    traceback.print_stack()
register_opts called
  File "/opt/stackstorm/st2/bin/st2notifier", line 18, in <module>
    from st2actions.cmd import st2notifier
  File "/opt/stackstorm/st2/lib/python2.7/site-packages/st2actions/cmd/st2notifier.py", line 13, in <module>
    from st2actions.notifier import config
  File "/opt/stackstorm/st2/lib/python2.7/site-packages/st2actions/notifier/config.py", line 60, in <module>
    register_opts()
  File "/opt/stackstorm/st2/lib/python2.7/site-packages/st2actions/notifier/config.py", line 31, in register_opts
    _register_common_opts()
  File "/opt/stackstorm/st2/lib/python2.7/site-packages/st2actions/notifier/config.py", line 40, in _register_common_opts
    common_config.register_opts()
  File "/opt/stackstorm/st2/lib/python2.7/site-packages/st2common/config.py", line 55, in register_opts
    traceback.print_stack()
2018-06-04 09:09:15,702 DEBUG [-] Using config files: /etc/st2/st2.conf
2018-06-04 09:09:15,702 DEBUG [-] Using logging config: /etc/st2/logging.notifier.conf
2018-06-04 09:09:15,705 INFO [-] Connecting to database "st2" @ "127.0.0.1:27017" as user "None".
2018-06-04 09:09:15,708 DEBUG [-] Ensuring database indexes...
2018-06-04 09:09:15,791 DEBUG [-] Skipping index cleanup for blacklisted model "PermissionGrantDB"...
...

Are you saying that you cannot see the same? Does it mean my installation is somehow got kinda broken? And the same happening on the other box where I found this issue in the first play?

I'll post my st2.conf if you confirm that you still cannot reproduce it, because this is all very strange.

emptywee commented 6 years ago

And yeah, since all those opts registering happen before regular debug log is printed, it doesn't really matter whether I add --debug or not, it crashes before it comes to printing anything.

arm4b commented 3 years ago

@emptywee Wondering if you could find a fix or workaround for this?

emptywee commented 3 years ago

Mmm, seems pretty old to me and I don't remember all deets. Prolly a workaround would be having a single PTR record for the server st2 is running on :) Need to reproduce it first to refresh my memory on this. Also, I don't have 3.3.0 running here yet, still on 3.2.0.