auto-pi-lot / autopilot

Distributed behavioral experiments
https://docs.auto-pi-lot.com
Mozilla Public License 2.0
89 stars 23 forks source link

PING sometimes not received #181

Open cxrodgers opened 1 year ago

cxrodgers commented 1 year ago

I have two Parent pis (rpi_parent01 and rpi_parent02) connected to a Terminal. On occasion, when starting the Terminal GUI, one or both of the Pis is not recognized. The symptom is that the green "START" button remains gray on the Terminal GUI. Clicking the gray button or restarting the Terminal often fixes it. I think the issue is that the Pis aren't reliably receiving the PING message.

I'm trying to get to the bottom of why this happens, because even though it's not a big deal in the case I'm describing, I just don't understand why it isn't more reliable. I also occasionally get sessions that crash in the middle, potentially when there's heavy network traffic. Although that doesn't happen often enough for me to replicate or debug, I'm thinking that if I can debug this PING issue, it might be a clue to the more rare crashes.

Here are the Terminal log files in networking.station_02.log for a successful start where both Pis are recognized.

2022-07-26 19:05:20,508 - networking.station_02 - INFO : parent, module-level logger created: networking.station_02
2022-07-26 19:05:20,510 - networking.station_02.Terminal_Station.T - INFO : Logger created: networking.station_02.Terminal_Station.T
2022-07-26 19:05:20,511 - networking.station_02.Terminal_Station.T - INFO : Starting IOLoop
2022-07-26 19:05:20,652 - networking.station_02.Terminal_Station.T - DEBUG : RECEIVED: ID: _T_0; TO: T; SENDER: _T; KEY: INIT; FLAGS: {'NOREPEAT': True}; VALUE: None
2022-07-26 19:05:20,653 - networking.station_02.Terminal_Station.T - DEBUG : MESSAGE SENT - ID: T_0; TO: rpi_parent01; SENDER: T; KEY: PING; FLAGS: {'NOLOG': True}; VALUE: None
2022-07-26 19:05:20,655 - networking.station_02.Terminal_Station.T - DEBUG : MESSAGE SENT - ID: T_1; TO: rpi_parent02; SENDER: T; KEY: PING; FLAGS: {'NOLOG': True}; VALUE: None
2022-07-26 19:05:20,678 - networking.station_02.Terminal_Station.T - DEBUG : RECEIVED: ID: rpi_parent02_5; TO: T; SENDER: rpi_parent02; KEY: STATE; FLAGS: {'NOLOG': True}; VALUE: IDLE
2022-07-26 19:05:20,679 - networking.station_02.Terminal_Station.T - DEBUG : MESSAGE SENT - ID: T_3; TO: rpi_parent02; SENDER: T; KEY: CONFIRM; FLAGS: {}; VALUE: rpi_parent02_5
2022-07-26 19:05:20,681 - networking.station_02.Terminal_Station.T - DEBUG : MESSAGE SENT - ID: T_2; TO: _T; SENDER: T; KEY: STATE; FLAGS: {}; VALUE: {'state': 'IDLE', 'pilot': 'rpi_parent02'}
2022-07-26 19:05:20,683 - networking.station_02.Terminal_Station.T - DEBUG : RECEIVED: ID: rpi_parent01_62; TO: T; SENDER: rpi_parent01; KEY: STATE; FLAGS: {'NOLOG': True}; VALUE: IDLE
2022-07-26 19:05:20,683 - networking.station_02.Terminal_Station.T - DEBUG : MESSAGE SENT - ID: T_4; TO: P_rpi_parent02; SENDER: T; KEY: STATE; FLAGS: {}; VALUE: IDLE
2022-07-26 19:05:20,685 - networking.station_02.Terminal_Station.T - DEBUG : MESSAGE SENT - ID: T_5; TO: _T; SENDER: T; KEY: STATE; FLAGS: {}; VALUE: {'state': 'IDLE', 'pilot': 'rpi_parent01'}
2022-07-26 19:05:20,687 - networking.station_02.Terminal_Station.T - DEBUG : MESSAGE SENT - ID: T_6; TO: rpi_parent01; SENDER: T; KEY: CONFIRM; FLAGS: {}; VALUE: rpi_parent01_62
2022-07-26 19:05:20,688 - networking.station_02.Terminal_Station.T - DEBUG : MESSAGE SENT - ID: T_7; TO: P_rpi_parent01; SENDER: T; KEY: STATE; FLAGS: {}; VALUE: IDLE

Here is the same log file, just 4 seconds later, when I restarted the Terminal but only one Pi was recognized and not the other. The PING is sent but it doesn't seem to be received by rpi_parent01.

2022-07-26 19:05:24,845 - networking.station_02 - INFO : parent, module-level logger created: networking.station_02
2022-07-26 19:05:24,847 - networking.station_02.Terminal_Station.T - INFO : Logger created: networking.station_02.Terminal_Station.T
2022-07-26 19:05:24,848 - networking.station_02.Terminal_Station.T - INFO : Starting IOLoop
2022-07-26 19:05:24,989 - networking.station_02.Terminal_Station.T - DEBUG : RECEIVED: ID: _T_0; TO: T; SENDER: _T; KEY: INIT; FLAGS: {'NOREPEAT': True}; VALUE: None
2022-07-26 19:05:24,990 - networking.station_02.Terminal_Station.T - DEBUG : MESSAGE SENT - ID: T_0; TO: rpi_parent01; SENDER: T; KEY: PING; FLAGS: {'NOLOG': True}; VALUE: None
2022-07-26 19:05:24,992 - networking.station_02.Terminal_Station.T - DEBUG : MESSAGE SENT - ID: T_1; TO: rpi_parent02; SENDER: T; KEY: PING; FLAGS: {'NOLOG': True}; VALUE: None
2022-07-26 19:05:25,014 - networking.station_02.Terminal_Station.T - DEBUG : RECEIVED: ID: rpi_parent02_7; TO: T; SENDER: rpi_parent02; KEY: STATE; FLAGS: {'NOLOG': True}; VALUE: IDLE
2022-07-26 19:05:25,016 - networking.station_02.Terminal_Station.T - DEBUG : MESSAGE SENT - ID: T_3; TO: rpi_parent02; SENDER: T; KEY: CONFIRM; FLAGS: {}; VALUE: rpi_parent02_7
2022-07-26 19:05:25,017 - networking.station_02.Terminal_Station.T - DEBUG : MESSAGE SENT - ID: T_2; TO: _T; SENDER: T; KEY: STATE; FLAGS: {}; VALUE: {'state': 'IDLE', 'pilot': 'rpi_parent02'}
2022-07-26 19:05:25,018 - networking.station_02.Terminal_Station.T - DEBUG : MESSAGE SENT - ID: T_4; TO: P_rpi_parent02; SENDER: T; KEY: STATE; FLAGS: {}; VALUE: IDLE

If I look in the same logfile on the Pis, I can see the PINGs when they are received, but nothing at all is logged when the PING isn't received. Here's the logfile from the Pi that started up correctly twice:

2022-07-26 19:05:20,489 - networking.station_02.Pilot_Station.rpi_parent02 - DEBUG : RECEIVED: ID: T_1; TO: rpi_parent02; SENDER: T; KEY: PING; FLAGS: {'NOLOG': True}; VALUE: None
2022-07-26 19:05:24,827 - networking.station_02.Pilot_Station.rpi_parent02 - DEBUG : RECEIVED: ID: T_1; TO: rpi_parent02; SENDER: T; KEY: PING; FLAGS: {'NOLOG': True}; VALUE: None

And the same logfile from the Pi that started the first time, but not the second time:

2022-07-26 19:05:20,540 - networking.station_02.Pilot_Station.rpi_parent01 - DEBUG : RECEIVED: ID: T_0; TO: rpi_parent01; SENDER: T; KEY: PING; FLAGS: {'NOLOG': True}; VALUE: None

This is with autopilot 0.4.4, I haven't recently upgraded, and I also have my own changes in this fork that might be causing this problem :grimacing: Anyone encountered this or have ideas about what to check?

sneakers-the-rat commented 1 year ago

This is a recurring issue that has to do with the asymmetric connections between terminal and Pi: the terminal's router socket binds the port, and the pilot connects to it. So when the pilot is maintaining a connection, but the terminal dies for some reason, then there's no robust guarantee that the pilot will re-establish the connection.

The ping should arguably come from the pilot rather than the terminal, but I had hesitated to add a persistent thread to the station classes or the pilot, but I think that would be the most straightforward approach to a patch - the dealer socket that connects to the bound router socket should check that it's alive periodically and attempt to refresh the connection if not.

Related to the overdue work of simplifying the network topology into a more p2p-like network with a generic error-handling mechanism for messages with delivery failures. https://github.com/auto-pi-lot/autopilot/issues/48

week and a half until dissertation and then I really want to try reworking the networking stuff because it's sort of a prerequisite and a stepping stone to some other p2p work i want to do next. This, the agent structure, and the GUI are the last three major points that need to be rebuilt from their initial form, but fixing this makes all the others easier. just need to find the time.

cxrodgers commented 1 year ago

thanks for the explanation! Because I don't know much about ZMQ and networking, I'm unclear on why the pilot will sometimes but not always re-establish the connection. Does clicking the gray box to ping again cause the connection to be re-established, or does it just reveal that the connection was re-established already? Interesting that it might just need a bit more time ... Maybe we can add a second Ping attempt ~5 seconds after starting the Terminal GUI.

It seems a perfectly reasonable workaround is restarting autopilot on the pis, and/or just waiting and clicking the gray box, so therefore this is a low-priority issue.

In any case it doesn't sound like this has anything to do with other stability issues I'm experiencing. I occasionally get SegmentationFault on the Terminal GUI for no apparent reason. I suspect it's because I'm now sending massive network traffic from Child pis to Parent pis, in order to log every individual sound time from many speakers and every individual poke, and writing it all to the HDF5, and maybe overloading something. I've also changed a bunch of stuff in my fork. I'll keep looking.

sneakers-the-rat commented 1 year ago

I am also unclear on why it works sometimes and not others.

The ping button will send a message to the PI requesting it to send back its state, and receiving a state update is what will change the button to 'green' - if it works, it's likely the case that the connection is established, but the terminal didnt' receive a 'state' message for some reason (it was already sent, etc.).

There is a 'heartbeat' method in the terminal that is currently not called on init because it's been historically troublesome - https://github.com/auto-pi-lot/autopilot/blob/04b5968ba02c8a1413a27eb6a138b6a186b130f1/autopilot/agents/terminal.py#L502 - but that just reflects the sort of messiness of the agent classes (hence needing to clean them up). The notion of what agents should do is clearer to me now, and i think that should be something that the network nodes do, rather than something the terminal does specifically, but again in this case for a short-term patch it might be one way of making it less annoying to use in the meantime.

I'd like to see the segfault - in the past I have mostly received those from trying to do multithreading within Qt stuff, which really doesn't like using python threads. So that might come from plotting or otherwise invoking some GUI event when the message is received. The data handling method naively sends any data to both the terminal for storage (via the subject class) as well as to the relevant plot, even when there is no plot configuration associated with a particular kind of data, which should be fixed and is one place to look. v0.5.0 also does a lot of work on the data storage system, with better locks and control over access to the hdf5 file, so that's another possibility. It should be possible to log all that data, because i've had no problem being able to write eg. continuous streams from sensors or cameras which are presumably more data than the discrete poke/sound metadata, so yes! let's diagnose a segfault

cxrodgers commented 1 year ago

When I have observed the segfault, there is no seemingly useful debugging information. Just the usual string of messages in the CLI, and then "Segmentation Fault", and the Terminal GUI disappears. I've asked people to screenshot the window when this happens again, because I am not sure that error messages always make it into the logfiles.

It's definitely not as much data as you get with a sensor/camera, but it might be more separate messages (each relatively small) all arriving at once. I wonder if somehow it is trying to write data from two messages at once, or possibly plot results from two messages at once.

I'm doing some custom plotting, but nothing more complex than the standard plots (no extra threads), so I hope that is not the issue. I did already suspect the HDF5 writes, because I've had segfaults with HDF5 in other, non-Autopilot contexts. Upgrading to v0.5 might fix this!

Now that I look at the new version of https://github.com/auto-pi-lot/autopilot/blob/690cca77e31bf58c94c41f27f87af87bc5a66531/autopilot/data/subject.py#L839, I see there's been a lot of very nice looking updates to the _data_thread function, which I've also been messing around in. Looking forward to updating!

sneakers-the-rat commented 1 year ago

Anything being fed through the usual terminal data methods should use save_data which just drops the data into a queue that's iterated over by _data_thread, so it shouldn't be trying to do simultaneous writes (and since the h5f file is only ever accessed within a context manager that by default locks access, unless multiple versions of a Subject object are created that should be impossible. which reminds me that the Subject object should forbid that.)

Multiple calls to plot at once is plausible, as there isn't any locking or queueing there, and since the GUI crashes that seems likely to me. The Segfault wouldn't be caught in the log since it is thrown by the underlying Qt process, but should be present in stderr (or if you're running as a systemd daemon accessible with journalctl -u autopilot ). I have some notes about debugging Qt around here somewhere...

Yes! did a lot of work on how data is handled and specified generally within autopilot, notes here: https://docs.auto-pi-lot.com/en/latest/changelog/v0.5.0.html . This is a transitional upgrade towards making these kinds of data models universal across autopilot to start eliminating all the implicit logic that makes it hard to use, but with some additional work we should have a much more robust and flexible data management system on our hands (got some prototypes for automatic export to datajoint and NWB, but they're very skeletal!)

Please do backup before upgrading to v0.5.0a0, I did not have the time to write full tests because dissertation, and there is some stuff in there for upgrading old subject files to the new format that is written to be nondestructive but I would hate for you to lose anything.