OpenAgricultureFoundation / openag_brain

ROS package for controlling an OpenAg food computer
GNU General Public License v3.0
221 stars 68 forks source link

event loop crashes when pi loses sync with arduino and port changes #166

Closed jakerye closed 7 years ago

jakerye commented 7 years ago

Summary

When running the recipe, the arduino will periodically lose sync with the raspberry pi. When this happens, the pi restarts the arduino. When the arduino restarts, sometimes the port will change from ttyACM0 to ttyACM1. This breaks the event loop on the raspi.

Docker logs

[INFO] [WallTime: 1492639648.890978] Setup publisher on /sensors/water_level_sensor_high_1/is_on/raw [std_msgs/Bool]
[INFO] [WallTime: 1492639648.912157] Setup publisher on /sensors/atlas_ph_1/water_potential_hydrogen/raw [std_msgs/Float32]
[INFO] [WallTime: 1492639648.932244] Setup publisher on /sensors/atlas_ec_1/water_electrical_conductivity/raw [std_msgs/Float32]
Traceback (most recent call last):
  File "/home/pi/catkin_ws/src/rosserial/rosserial_python/nodes/serial_node.py", line 85, in <module>
    client.run()
  File "/home/pi/catkin_ws/src/rosserial/rosserial_python/src/rosserial_python/SerialClient.py", line 503, in run
    self.requestTopics()
  File "/home/pi/catkin_ws/src/rosserial/rosserial_python/src/rosserial_python/SerialClient.py", line 389, in requestTopics
    self.port.flushInput()
  File "/usr/lib/python2.7/dist-packages/serial/serialposix.py", line 500, in flushInput
    termios.tcflush(self.fd, TERMIOS.TCIFLUSH)
termios.error: (5, 'Input/output error')

Similar issues: rosserial-running-please

Source

Make sure when a new device is connected, a new file is created. Check which port the arduio is connected to:

ls -la /dev/ttyACM*

Output:

crw-rw---- 1 root dialout 166, 1 Apr 19 16:40 /dev/ttyACM1

Hypothesis:

Test:

Results

Next steps:

gordonbrander commented 7 years ago

We should probably be trying several device locations:

import serial
import time

locations=['/dev/ttyUSB0','/dev/ttyUSB1','/dev/ttyUSB2','/dev/ttyUSB3',
'/dev/ttyS0','/dev/ttyS1','/dev/ttyS2','/dev/ttyS3']  

for device in locations:  
    try:  
        print "Trying...",device
        arduino = serial.Serial(device, 9600) 
        break
    except:  
        print "Failed to connect on",device   

try:  
    arduino.write('Y')  
    time.sleep(1)
    print arduino.readline()
except:  
    print "Failed to send!" 
gordonbrander commented 7 years ago

@jakerye in the event we keep the Arduino, rather than connecting directly to the Pi... what do you think about ditching ROSSerial and writing a bespoke message protocol? We could send binary data with sentinel bits describing the environmental variable.

jakerye commented 7 years ago

Test Several Device Locations

I like the idea of trying several device locations. we could also query the connected usb devices:

import re
import subprocess
device_re = re.compile("Bus\s+(?P<bus>\d+)\s+Device\s+(?P<device>\d+).+ID\s(?P<id>\w+:\w+)\s(?P<tag>.+)$", re.I)
df = subprocess.check_output("lsusb")
devices = []
for i in df.split('\n'):
    if i:
        info = device_re.match(i)
        if info:
            dinfo = info.groupdict()
            dinfo['device'] = '/dev/bus/usb/%s/%s' % (dinfo.pop('bus'), dinfo.pop('device'))
            devices.append(dinfo)
print devices

Output:

[
{'device': '/dev/bus/usb/001/009', 'tag': 'Apple, Inc. Optical USB Mouse [Mitsumi]', 'id': '05ac:0304'},
{'device': '/dev/bus/usb/001/001', 'tag': 'Linux Foundation 2.0 root hub', 'id': '1d6b:0002'},
{'device': '/dev/bus/usb/001/002', 'tag': 'Intel Corp. Integrated Rate Matching Hub', 'id': '8087:0020'},
{'device': '/dev/bus/usb/001/004', 'tag': 'Microdia ', 'id': '0c45:641d'}
]

Then use this information to verify device tag or id.

Custom Protocol

I agree on having to write our own protocol, sentinals make sense for env vars as long as we don't make them absolute (e.g. sentinal 1 could change from air_temp to air_humid by changing some config). We could probably start with something relatively simple just to get up and running but after that we could also put in some confirmation message so we don't have to constantly stream the actuator commands.

rwdavis513 commented 7 years ago

The Food Computers in the Utah office are losing sync but are not reconnecting at any other location, so it doesn't look like simply trying other devices will work as a temporary fix for all the issues.

rwdavis513 commented 7 years ago

What if we tried decreasing the baud rate? Perhaps that would help to keep the serial bus from overloading? http://answers.ros.org/question/11022/rosserial_arduino-trouble-with-too-much-messaging/

Is there a way we could determine if it's the messages coming the arduino to the RaspberryPi or from the Raspberry to the arduino that is causing the overload?

ghost commented 7 years ago

This is a wild guess based on experiments I did a couple months ago, but there's a halfway decent chance one or more I2C firmware modules are getting locked up in blocking I/O. The AM2315 or MHZ16 code would be my top bets.

My sense from reading library code used by the firmware modules is that it's not uniformly designed to be robust against misbehaving sensors. One way you might be able to check for this sort of thing is to put LEDs with large current limiting resistors on the SDA and SCL pins and watch the flickering. To be a little fancier about it, you could use a dual channel scope or logic analyzer with high impedance probes. My guess is that you will see regular flickering on the I2C pins, then it will stop (deadlock in firmware module code), then ROS will get mad and reset the arduino.

Another way you could approach this would be to use spare Arduino pins to light a different LED for each firmware module. Turn the LED on when the module's main function starts, then turn it back off when it returns. The module whose LED gets stuck before a reset is probably your culprit.

[edit: To extend the spare pin idea a little further, you could use a second Arduino to trace and profile the PFC2 Arduino's module activity indicator pins. The sketch could be as simple as reading an input port (8 pins at a time), printing that to the serial monitor as hex, then using a python script to analyze the pins and timing.]

rwdavis513 commented 7 years ago

Thanks for your input @wsnook! I'll look into giving one of those ideas a try. To start, I was thinking of just decreasing the number of sensors/actuators that are started in the openag_brain launch file to see if it stops occurring.

I was able to determine that the rosserial_python will still work after openag_brain loses connection, so it appears the arduino will still work correctly once the bus is cleared without having to reflash it.

ghost commented 7 years ago

If you want to turn something off, I'd start with the MHZ16.

novemberalpha commented 7 years ago

I'd offer that going to bespoke is good as long as our approach is more resilient and we still facilitate the ability to add custom sensors in a simpler manner. I'm still getting my head around the current method and it's a bit tricky.

There are some issues with apps fighting for the Arduino. If a flash is triggered while ROS is monitoring the Arduino the flash will give timeout errors. The unplug/replug solution breaks the connection with ROS and allows (temporarily) the flash to burn the Arduino.

This is all theory btw. I have not traced the code.

To get an arduino to hop off it's TTY connection it needs to overload and reboot while a task is connected to it. Having gone through a couple cheap mega clones I wonder if testing a original Mega2560 would make sense. I guess technically the processor is the same, but I've had clones that delivered the wrong voltage on pins and one that wouldn't burn fresh out of the box. They are cheap junk.

We'll need to rig up a monitoring system. I'd suggest adding a logger to the arduino and dumping to it. https://learn.sparkfun.com/tutorials/openlog-hookup-guide

Hopefully the logs can help us find the culprit.

Finally, I'd like to vote that we keep the Arduino vs using GPIO on the Pi. My first driver is compatibility (hi PFC1 guy!) My second is the Pi is a lousy MCU. Too much software running on a CPU to guarantee simple switch functions fire in a timely manner. Maybe that's not a big deal. Here's an article describing the differences.

Cheers!

ghost commented 7 years ago

If re-engineering the firmware and maybe changing wiring are on the table, here's a handful of suggestions for how you could make things more transparent, robust, and easy to debug:

  1. Consider moving the Atlas Scientific sensor boards from I2C to serial. The Mega 2560 has extra serial pins, and the sensor boards come in serial mode by default. This would simplify your isolation circuitry and remove a difficult to explain setup step--switching the boards to I2C mode. The big bonus is that you'd never have to worry about serial signals messing up your I2C bus if somebody wires up an un-configured sensor board. (@rwdavis513 , are you sure your Atlas boards are all in I2C mode?)

  2. Write a sketch that includes drivers for all the sensors in the food computer ecosystem, and use I2C address detection and/or other methods to figure out which sensors are present. The Mega 2560 has tons of pins, so for non-I2C sensors, you could dedicate certain pins for stuff like 1-wire, water level switch, etc. For stuff like normally open SPST switches that would be hard to detect electrically, use configuration jumpers or ifdefs in the sketch.

  3. Make the Arduino work like a REPL on the serial port. Don't use binary. If you make the firmware work like a human readable shell, it will be trivial to test your real firmware manually or with a simple test harness. Imagine ssh-ing from your phone to the Pi, doing a screen /dev/ttyACM0, and changing the lights with something like L255,144,49\n (red=255, blue=144, white=49). For sensor output, you could use JSON, CSV, or whatever. You could even skip all hardware detection of sensor presence and just provide a different REPL command for reading each sensor.

  4. Use a python program on the Pi to translate from the Arduino's dirt-simple plaintext protocol to whatever ROS needs. Right now ROS knows too much about the Arduino firmware implementation, and the lack of separation is causing a debugging problem--you can't rely on the usual Arduino IDE and serial monitor because the firmware is auto-generated and auto-flashed by configuration higher in the stack. If you build it so there's an option to plug a laptop into the Arduino, your hardware debugging workflow will have the potential to be awesome.

rwdavis513 commented 7 years ago

Thanks for all the input and suggestions! It's a big help.

I was able to narrow my issue down to the water pump, ph up and ph down pumps. Each of them when published to just once will cause the Arduino and Raspberry Pi to lose sync on some of the machines. Next step is to take a closer look at the firmware for those pumps to see what could be causing the issue.

ghost commented 7 years ago

Interesting. Do you have access to an oscilloscope there? Since your problem involves inductive loads switched by relays, it wouldn't hurt to check for transients on the power supply rails and signal lines for the motors, relay board, Arduino, and Raspberry Pi.

gordonbrander commented 7 years ago

If we go with https://github.com/OpenAgInitiative/openag_brain/issues/247, we should make sure that our new arduino handler handles this case gracefully and also scans several serial ports to find the one with the Arduino.

ghost commented 7 years ago

If you play your cards carefully, the scan-several-serial-ports approach might also give you another alternative to cross compilation like is being discussed over at https://github.com/OpenAgInitiative/openag_brain/issues/235.

For example, I've written my controller to work on macOS by looking for microcontrollers on /dev/cu.usbmodem* and /dev/ttyACM*. If you write your python code to list /dev and look for those device patterns, or whatever else your Arduinos might show up as on Raspberry Pi or your development machines, then maybe you can just run ROS on a laptop for developing.

sp4ghet commented 7 years ago

@jakerye is a god.

https://askubuntu.com/questions/510681/how-to-make-my-udev-rule-work

In the approved answer section, they mention that there are static symlinks under /dev/serial/by-id so we just need to extract that name.