VOLTTRON / volttron

VOLTTRON Distributed Control System Platform
https://volttron.readthedocs.io/
Other
457 stars 216 forks source link

[develop] Unable to read config and SQL database #2016

Closed tnesztler closed 5 years ago

tnesztler commented 5 years ago

Description of Issue

When HistorianAgent cannot publish data, it tries to send an alert to the health subsystem. This doesn't work due to the BaseHistorianAgent being unable to read the config file. Moreover, it tries to save data to the backup.sqlite database but it can't open it.

Affected Version

develop

Expected

The historian should publish an alert and save data in backup.sqlite database while publishing capability is being resolved.

Actual

The agent crashes and no data is being saved. No notification of it failing is being sent either.

Steps to Reproduce

Prevent the agent from publishing data. It should crash.

Additional Details

NOTE: Running Volttron as a systemctl service (the same way as other instances on master branch without issues)

volttron.service

[Unit]
Description=VOLTTRON Platform Service
After=network.target

[Service]
Type=simple
User=ecorithm
Group=ecorithm
WorkingDirectory=/home/ecorithm/volttron
#Uncomment and change this to specify a different VOLTTRON_HOME
Environment="VOLTTRON_HOME=/home/ecorithm/.volttron"
ExecStart=/home/ecorithm/volttron/env/bin/volttron -L /home/ecorithm/volttron/examples/rotatinglog.py
ExecStop=/home/ecorithm/volttron/env/bin/volttron-ctl shutdown --platform

[Install]
WantedBy=multi-user.target

Stack trace of the crash after it cannot send data.

2019-05-23 10:31:07,785 (facts_serviceagent-1.3.2 21494) <stderr> ERROR: Traceback (most recent call last):
2019-05-23 10:31:07,787 (facts_serviceagent-1.3.2 21494) <stderr> ERROR:   File "/home/ecorithm/volttron/volttron/platform/async.py", line 190, in _run_call
2019-05-23 10:31:07,790 (facts_serviceagent-1.3.2 21494) <stderr> ERROR:   File "/home/ecorithm/volttron/volttron/platform/agent/base_historian.py", line 960, in _send_alert_callback
2019-05-23 10:31:07,790 (facts_serviceagent-1.3.2 21494) <stderr> ERROR:     self.vip.health.send_alert(key, alert_status)
2019-05-23 10:31:07,791 (facts_serviceagent-1.3.2 21494) <stderr> ERROR:   File "/home/ecorithm/volttron/volttron/platform/vip/agent/subsystems/health.py", line 90, in send_alert
2019-05-23 10:31:07,792 (facts_serviceagent-1.3.2 21494) <stderr> ERROR:   File "/home/ecorithm/volttron/volttron/platform/agent/utils.py", line 219, in get_fq_identity
2019-05-23 10:31:07,793 (facts_serviceagent-1.3.2 21494) <stderr> ERROR:   File "/home/ecorithm/volttron/volttron/platform/agent/utils.py", line 181, in get_platform_instance_name
2019-05-23 10:31:07,794 (facts_serviceagent-1.3.2 21494) <stderr> ERROR:   File "/home/ecorithm/volttron/volttron/platform/agent/utils.py", line 174, in load_platform_config
2019-05-23 10:31:07,795 (facts_serviceagent-1.3.2 21494) <stderr> ERROR:   File "/usr/lib/python2.7/ConfigParser.py", line 279, in options
2019-05-23 10:31:07,795 (facts_serviceagent-1.3.2 21494) <stderr> ERROR: NoSectionError: No section: 'volttron'
2019-05-23 10:31:07,797 (facts_serviceagent-1.3.2 21494) <stderr> ERROR: <bound method FactsService._send_alert_callback of <facts_service.agent.FactsService object at 0x7f60b41d78d0>> failed with NoSectionError
2019-05-23 10:31:07,797 (facts_serviceagent-1.3.2 21494) <stderr> ERROR:
2019-05-23 10:35:00,291 (facts_serviceagent-1.3.2 21494) <stderr> ERROR: Exception in thread Thread-2:
2019-05-23 10:35:00,291 (facts_serviceagent-1.3.2 21494) <stderr> ERROR: Traceback (most recent call last):
2019-05-23 10:35:00,292 (facts_serviceagent-1.3.2 21494) <stderr> ERROR:   File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
2019-05-23 10:35:00,293 (facts_serviceagent-1.3.2 21494) <stderr> ERROR:   File "/usr/lib/python2.7/threading.py", line 754, in run
2019-05-23 10:35:00,294 (facts_serviceagent-1.3.2 21494) <stderr> ERROR:   File "/home/ecorithm/volttron/volttron/platform/agent/base_historian.py", line 978, in _process_loop
2019-05-23 10:35:00,294 (facts_serviceagent-1.3.2 21494) <stderr> ERROR:     self._do_process_loop()
2019-05-23 10:35:00,295 (facts_serviceagent-1.3.2 21494) <stderr> ERROR:   File "/home/ecorithm/volttron/volttron/platform/agent/base_historian.py", line 1037, in _do_process_loop
2019-05-23 10:35:00,296 (facts_serviceagent-1.3.2 21494) <stderr> ERROR:     cache_full = backupdb.backup_new_data((x for x in new_to_publish if x is not None))
2019-05-23 10:35:00,297 (facts_serviceagent-1.3.2 21494) <stderr> ERROR:   File "/home/ecorithm/volttron/volttron/platform/agent/base_historian.py", line 1386, in backup_new_data
2019-05-23 10:35:00,298 (facts_serviceagent-1.3.2 21494) <stderr> ERROR:     (timestamp, source, topic_id, dumps(value), dumps(headers)))
2019-05-23 10:35:00,299 (facts_serviceagent-1.3.2 21494) <stderr> ERROR: OperationalError: unable to open database file
2019-05-23 10:35:00,299 (facts_serviceagent-1.3.2 21494) <stderr> ERROR:
2019-05-23 10:35:00,299 (facts_serviceagent-1.3.2 21494) <stderr> ERROR: Traceback (most recent call last):
2019-05-23 10:35:00,300 (facts_serviceagent-1.3.2 21494) <stderr> ERROR:   File "/home/ecorithm/volttron/volttron/platform/async.py", line 190, in _run_call
2019-05-23 10:35:00,301 (facts_serviceagent-1.3.2 21494) <stderr> ERROR:   File "/home/ecorithm/volttron/volttron/platform/agent/base_historian.py", line 960, in _send_alert_callback
2019-05-23 10:35:00,302 (facts_serviceagent-1.3.2 21494) <stderr> ERROR:     self.vip.health.send_alert(key, alert_status)
2019-05-23 10:35:00,303 (facts_serviceagent-1.3.2 21494) <stderr> ERROR:   File "/home/ecorithm/volttron/volttron/platform/vip/agent/subsystems/health.py", line 90, in send_alert
2019-05-23 10:35:00,304 (facts_serviceagent-1.3.2 21494) <stderr> ERROR:   File "/home/ecorithm/volttron/volttron/platform/agent/utils.py", line 219, in get_fq_identity
2019-05-23 10:35:00,305 (facts_serviceagent-1.3.2 21494) <stderr> ERROR:   File "/home/ecorithm/volttron/volttron/platform/agent/utils.py", line 181, in get_platform_instance_name
2019-05-23 10:35:00,306 (facts_serviceagent-1.3.2 21494) <stderr> ERROR:   File "/home/ecorithm/volttron/volttron/platform/agent/utils.py", line 174, in load_platform_config
2019-05-23 10:35:00,307 (facts_serviceagent-1.3.2 21494) <stderr> ERROR:   File "/usr/lib/python2.7/ConfigParser.py", line 279, in options
2019-05-23 10:35:00,307 (facts_serviceagent-1.3.2 21494) <stderr> ERROR: NoSectionError: No section: 'volttron'
2019-05-23 10:35:00,309 (facts_serviceagent-1.3.2 21494) <stderr> ERROR: <bound method FactsService._send_alert_callback of <facts_service.agent.FactsService object at 0x7f60b41d78d0>> failed with NoSectionError

Config File

[volttron]
vip-address = tcp://127.0.0.1:22916
message-bus = zmq
instance-name = ecorithm-N3160
craig8 commented 5 years ago

@tnesztler to "prevent from publishing data" you put it into cache-only mode or how are you preventing it from publishing?

tnesztler commented 5 years ago

Looking into the logs more closely, the Base Historian Agent failed due a gevent issue which seems to be the root problem.

2019-05-23 10:31:07,766 (facts_serviceagent-1.3.2 21494) <stderr> ERROR: Traceback (most recent call last):
2019-05-23 10:31:07,769 (facts_serviceagent-1.3.2 21494) <stderr> ERROR:   File "/home/ecorithm/volttron/env/local/lib/python2.7/site-packages/gevent/threadpool.py", line 207, in _worker
2019-05-23 10:31:07,770 (facts_serviceagent-1.3.2 21494) <stderr> ERROR: error: [Errno 16] Device or resource busy
2019-05-23 10:31:07,771 (facts_serviceagent-1.3.2 21494) <stderr> ERROR: (<ThreadPool at 0x7f60943fdc90 0/1/10>, <built-in function getaddrinfo>) failed with error
2019-05-23 10:31:07,771 (facts_serviceagent-1.3.2 21494) <stderr> ERROR:
2019-05-23 10:31:07,764 (facts_serviceagent-1.3.2 21494) facts_service.agent ERROR: Error when attempting to publish to target: ConnectionError(MaxRetryError("HTTPSConnectionPool(host='ZZZZZZZZZZZZZ', port=443): Max retries exceeded with url: XXXXXXXXX (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f60943fd050>: Failed to establish a new connection: [Errno 16] Device or resource busy',))",),)

I have tried to manually make the agent faulty w/o it resulting in the same error (it handles publishing errors w/o issues).

craig8 commented 5 years ago

@tnesztler the factservice is making a request using https. One of the things that we have done with adding https to volttron is to set a REQUEST_CA_BUNDLE environmental variable for connecting to the local https machine. Could this possibly be an issue for an external connection?

tnesztler commented 5 years ago

In the latest updates that you've pushed to develop, it seems to have been fixed, intentionally or not since even if it is crashing on something else, the agent restarts itself and no loss of data is happening.

I'll look at the REQUESTS_CA_BUNDLE env var if another issue like that arise. So far it pushes data to our API w/o issues.

That's the current error I have related to gevent (happening every day but not at the same time!) but I think we can close this specific issue for the time being.

2019-06-04 04:45:05,977 (facts_serviceagent-1.3.2 8353) <stderr> ERROR: SystemError: (libev) error creating signal/async pipe: Too many open files
2019-06-04 04:45:05,978 (facts_serviceagent-1.3.2 8353) <stderr> ERROR: Traceback (most recent call last):
2019-06-04 04:45:05,979 (facts_serviceagent-1.3.2 8353) <stderr> ERROR:   File "gevent/corecext.pyx", line 2067, in gevent.corecext._syserr_cb (gevent/gevent.corecext.c:35964)
2019-06-04 04:45:05,980 (facts_serviceagent-1.3.2 8353) <stderr> ERROR:   File "gevent/corecext.pyx", line 352, in gevent.corecext.loop._handle_syserr (gevent/gevent.corecext.c:6221)
2019-06-04 04:45:05,982 (facts_serviceagent-1.3.2 8353) <stderr> ERROR:   File "gevent/corecext.pyx", line 360, in gevent.corecext.loop.handle_error (gevent/gevent.corecext.c:6397)
2019-06-04 04:45:05,983 (facts_serviceagent-1.3.2 8353) <stderr> ERROR:   File "/home/ecorithm/volttron/env/local/lib/python2.7/site-packages/gevent/hub.py", line 565, in handle_error
2019-06-04 04:45:05,984 (facts_serviceagent-1.3.2 8353) <stderr> ERROR:     self.handle_system_error(type, value)
2019-06-04 04:45:05,985 (facts_serviceagent-1.3.2 8353) <stderr> ERROR:   File "/home/ecorithm/volttron/env/local/lib/python2.7/site-packages/gevent/hub.py", line 580, in handle_system_error
2019-06-04 04:45:05,986 (facts_serviceagent-1.3.2 8353) <stderr> ERROR:     self.parent.throw(type, value)
2019-06-04 04:45:05,987 (facts_serviceagent-1.3.2 8353) <stderr> ERROR: error: cannot switch to a different thread
2019-06-04 04:45:05,987 (facts_serviceagent-1.3.2 8353) <stderr> ERROR: (libev) error creating signal/async pipe: Too many open files
2019-06-04 04:51:16,878 () volttron.platform.aip INFO: starting agent /home/ecorithm/.volttron/agents/ab5d9dba-43e8-43b1-b98f-9341e1f8e443/facts_serviceagent-1.3.2
2019-06-04 04:51:16,900 () volttron.platform.aip INFO: agent /home/ecorithm/.volttron/agents/ab5d9dba-43e8-43b1-b98f-9341e1f8e443/facts_serviceagent-1.3.2 has PID 2462
2019-06-04 04:51:16,901 () volttron.platform.control INFO: Successfully restarted agent facts_serviceagent-1.3.2
2019-06-04 04:51:18,006 (facts_serviceagent-1.3.2 2462) <stderr> ERROR: Generating LALR tables