VOLTTRON / volttron

VOLTTRON Distributed Control System Platform
https://volttron.readthedocs.io/
Other
461 stars 218 forks source link

PlatformHistorianAgent stops stops responding after some number of publishes or queries #99

Closed timothykang closed 9 years ago

timothykang commented 9 years ago

Steps to reproduce:

  1. Empty data table in PlatformHistorianAgent DB.
  2. Start platform with PlatformAgent and PlatformHistorianAgent.
  3. Let platform run for 5 minutes.
  4. Open data table in DB.
  5. See only 36 rows of data, and not the expected 80 rows (5 min * 60 sec * 4 topics / 15 sec interval).

At the same time the historian stops logging data, it also stops responding to queries.

Adjusting the value passed to @Core.periodic in platformagent/agent.py has no effect on the number of rows inserted. Tried intervals of 1, 3, 5, and 15 and the number of rows inserted is always 36.

Alternatively, if I disable publishing by the PlatformAgent altogether (i.e. comment out @Core.periodic), making 10 to 20 or so queries to the historian will also cause it to stop responding.

timothykang commented 9 years ago

Upon further investigation, it looks like this issue was introduced by ffde099. Reverting it fixes this issue, though I don't know if it introduces others. Another effect I've noticed is agent logs no longer being printed.

CCing @hashstat @craig8

hashstat commented 9 years ago

@timothykang: Have you cleaned up your repository since pulling in that change? There were a couple commits before ffde099 that could cause problems. In particular, bff0b28 split the vip module into multiple modules inside a package. Try removing all the .pyc files within the volttron/project directories and ensure that the vipagent directory no longer exists. From within the project root, do find volttron/platform -name '*.pyc' -delete and rm -rf volttron/platform/vipagent.

Please see if the above solution solves the issue. If not, I will look into it further. Thanks.

timothykang commented 9 years ago

I performed the cleanup you described and still encountered the issue, and then tried a fresh clone with the same result.

craig8 commented 9 years ago

@hashstat let me look into this a little. I will give a report at the standup today.

hashstat commented 9 years ago

Sounds like an issue where a non-gevent-aware blocking operation is being performed in gevent context.

craig8 commented 9 years ago

@timothykang please verify that this is fixed in master branch.

craig8 commented 9 years ago

It seems that this is somehow tied to a mac...further research ongoing.

timothykang commented 9 years ago

I somehow discovered that commenting out all of the _log.debug calls "fixed" the issue, and so believe the issue to be tied to logging. I was able to create the following agent that causes the platform to stop responding after 140-ish calls of doLog:

import logging
import sys
import os

from volttron.platform.vip.agent import Agent, Core
from volttron.platform.agent import utils

utils.setup_logging()
_log = logging.getLogger(__name__)

def trouble_agent(config_path):
    class TroubleAgent(Agent):
        counter = 0

        @Core.periodic(0.1)
        def log(self):
            # ping to show activity, since part of the
            # issue is log entries not being displayed
            self.vip.ping('', str(self.counter))
            _log.debug(str(self.counter))
            self.counter += 1

    identity = os.environ.get('AGENT_UUID')
    return TroubleAgent(identity)

def main(argv=sys.argv):
    try:
        utils.default_main(trouble_agent, description='',
                           no_pub_sub_socket=True, argv=argv)
    except Exception:
        _log.exception('unhandled exception')

I haven't found another OS X machine to try it on, so I'm unable to say if this issue is specific to my machine or with OS X in general.

hashstat commented 9 years ago

@timothykang and I sat down together to explore this issue. We discovered that it only affects agents enabled to autostart. I was able to reproduce it on both OS-X and Linux. Luckily we found a fix which is to move autostart into the services thread. I expect that there are also some reads/writes from/to file objects which could potentially cause problems lurking in the logging. I will explore that some more.