auto-pi-lot / autopilot

Distributed behavioral experiments
https://docs.auto-pi-lot.com
Mozilla Public License 2.0
93 stars 24 forks source link

Cannot cleanly restart a Child task without restarting autopilot #168

Open Rodgers-PAC-Lab opened 2 years ago

Rodgers-PAC-Lab commented 2 years ago

For a while I've been using a Parent Pi connected to multiple Child Pi (as discussed here: https://github.com/wehr-lab/autopilot/issues/101). This works, but with the annoying problem that the autopilot process on each Child has to be manually quit and restarted between every session, or otherwise we get a ZMQError. I think the Child is not stopping correctly.

Full code for a minimal example (but see also relevant excerpts below if that is easier): The Parent task: https://github.com/Rodgers-PAC-Lab/autopilot/blob/paft2022/autopilot/tasks/paft_parent_child.py The Child task: https://github.com/Rodgers-PAC-Lab/autopilot/blob/b2105803b7bd4e36e39ac1a6f662ec93f0c85c96/autopilot/tasks/children.py#L35

Here is an excerpt of the code in init for the Parent task. The first Net_Node tells the Child to start, and the second Net_Node is used for bidirectional communication with the Child.

        self.node = Net_Node(id="T_{}".format(prefs.get('NAME')),
            upstream=prefs.get('NAME'),
            port=prefs.get('MSGPORT'),
            listens={},
            instance=False,
            )
        value = {
            'child': {
                'parent': prefs.get('NAME'), 'subject': subject},
            'task_type': 'PAFT_Child',
            'subject': subject,
            'reward': reward,
        }

        # send to the station object with a 'CHILD' key
        self.node.send(to=prefs.get('NAME'), key='CHILD', value=value)

        ## Create a second node to communicate with the child
        # We (parent) are the "router"/server
        # The children are the "dealer"/clients
        # Many dealers, one router        
        self.node2 = Net_Node(
            id='parent_pi',
            upstream='',
            port=5000,
            router_port=5001,
            listens={
                'HELLO': self.recv_hello,
                'POKE': self.recv_poke,
                },
            instance=False,
            )

Here is the code in the init for the Child task. This Net_Node is used to communicate with the Net_Node of the same name in the Parent.

        self.node2 = Net_Node(
            id=self.name,
            upstream='parent_pi',
            port=5001,
            upstream_ip=prefs.get('PARENTIP'),
            listens={
                'HELLO': self.recv_hello,
                'END': self.recv_end,
                },
            instance=False,
            )        

        # Send
        self.node2.send(
            'parent_pi', 'HELLO', {'from': self.name})

All of the above actually works! However, if we stop the session and start a new one, then the Parent generates an exception when we try to create self.node2 again: zmq.error.ZMQError: Address already in use

I've come up with two semi-fixes, neither of which really fixes the whole problem. First, to fix ZMQError, I have this line self.node2.router.close() in the end() function of the Parent task. After this fix, the ZMQError stops happening. Note that it's not sufficient to have self.node2.release(), so perhaps closing the router needs to be added as a step in releasing a Net_Node.

Second, I explicitly tell the Child class to end, by adding self.node.send(to=prefs.get('NAME'), key='CHILD', value={'KEY': 'STOP'}) in the end function of the Parent task. Does this look right? I think this is probably a good thing to do, though I don't see it in the example GoNoGo/Wheel_Child code.

Nonetheless, even after these fixes, I haven't yet succeeded in starting a Child task again, without rebooting the autopilot process. Any tips would be much appreciated!!

PS - I think we could also test/debug this in the GoNoGo/Wheel_Child standard task, but that one is not running for me, not sure if I am missing some hardware or what. It says something about a KeyError 'F' in initting some gpio pins.

Rodgers-PAC-Lab commented 2 years ago

success?? I think that adding self.node.sock.close() in both the Parent and Child end() functions might fix this. https://github.com/Rodgers-PAC-Lab/autopilot/commit/ec78bc95f9751013cb8a0e171e41c8170fb1db5b

along with the aforementioned self.node.router.close(), which is only needed in the Parent, for the special Net_Node that was initialized with a router_port. https://github.com/Rodgers-PAC-Lab/autopilot/commit/a10c0bdc0590debe037984407156624633444273

It seems perhaps there are some dangling sockets left open, and this is causing problems on subsequent session starts. I'll continue testing and if this works I would probably suggest adding self.node.sock.close() and if self.node.router: self.node.router.close() within Net_Node.release here: https://github.com/wehr-lab/autopilot/blob/90956187d4222f16f67ab8b39b8359da954d5dcc/autopilot/networking/node.py#L642

sneakers-the-rat commented 2 years ago

Awesome! yes i have been super sloppy with closing stuff, agree this is super annoying, will revisit after this current release

sneakers-the-rat commented 2 years ago

Figured out a reliable way to kill the IOLoop objects that keep the networking objects open. It's sort of a careful dance of trying to hold threads and processes open with a blocking IOLoop.start() call but then interact with them through callbacks launched by the on_recv callback of a ZMQSteam which are in the same process but different thread. Most of the operations of the IOLoop aren't thread safe, in fact the only one that is is the add_callback method. So by adding a callback that does this

self.loop.add_callback(lambda:IOLoop.instance().stop())

you can reliably kill them. You're right, also need to close the sockets. I'll work on cleaning this up in v0.5.0, which i'm preparing now.

cxrodgers commented 2 years ago

a few other things I discovered that seem to be important for cleanly shutting down, at least over here. I haven't extensively tested them yet, but I'm just mentioning it now, in case it rings a bell. Now I can cleanly shut down the terminal, and I never have to restart the autopilot process on the Pi between sessions, which wasn't the case for me before in our setup. I can test these out and make PRs after 0.5.0 too.

sneakers-the-rat commented 2 years ago

yes i just arrived at the same changes ;). working my way through some more of the eternal technical debt, thanks for the pointers as well, very helpful.