OpenAgricultureFoundation / openag_brain

ROS package for controlling an OpenAg food computer
GNU General Public License v3.0
221 stars 68 forks source link

rosserial: pi loses sync with the arduino very frequently #167

Closed jakerye closed 7 years ago

jakerye commented 7 years ago

Summary

The pi loses sync with the arduino very frequently. It seems to break when a lot of setpoints get commanded at the same time. Maybe rosserial or something in our handling of it breaks when there are a lot of messages? Maybe a serial buffer overflow? Possible fix could be to move the relay board to usb with:

Docker logs

If you look at the time stamps below they happen very frequently.

[INFO] [WallTime: 1492730209.500290] Setup subscriber on /actuators/pump_2_nutrient_b_1/cmd [std_msgs/Float32]
[INFO] [WallTime: 1492730209.526321] Setup subscriber on /actuators/chiller_compressor_1/cmd [std_msgs/Bool]
[INFO] [WallTime: 1492730209.548251] Setup subscriber on /actuators/air_flush_1/cmd [std_msgs/Float32]
[INFO] [WallTime: 1492730209.566007] Setup subscriber on /actuators/chamber_fan_1/cmd [std_msgs/Bool]
[INFO] [WallTime: 1492730209.594364] Setup subscriber on /actuators/pump_4_ph_down_1/cmd [std_msgs/Bool]
[INFO] [WallTime: 1492730261.221862] Setup publisher on /internal_diagnostics [openag_brain/DiagnosticArray]
[INFO] [WallTime: 1492730261.242477] Setup publisher on /sensors/ds18b20_1/temperature/raw [std_msgs/Float32]
[INFO] [WallTime: 1492730261.262350] Setup publisher on /sensors/mhz16_1/air_carbon_dioxide/raw [std_msgs/Int32]
[INFO] [WallTime: 1492730261.279576] Setup publisher on /sensors/am2315_1/air_humidity/raw [std_msgs/Float32]
[INFO] [WallTime: 1492730261.299621] Setup publisher on /sensors/am2315_1/air_temperature/raw [std_msgs/Float32]
[INFO] [WallTime: 1492730261.315564] Setup publisher on /sensors/water_level_sensor_high_1/is_on/raw [std_msgs/Bool]
[INFO] [WallTime: 1492730261.336859] Setup publisher on /sensors/atlas_ph_1/water_potential_hydrogen/raw [std_msgs/Float32]
[INFO] [WallTime: 1492730261.360886] Setup publisher on /sensors/atlas_ec_1/water_electrical_conductivity/raw [std_msgs/Float32]
[INFO] [WallTime: 1492737565.377940] Setup publisher on /internal_diagnostics [openag_brain/DiagnosticArray]
[ERROR] [WallTime: 1492744662.738160] Lost sync with device, restarting...
[ERROR] [WallTime: 1492744677.742697] Lost sync with device, restarting...
[ERROR] [WallTime: 1492744692.745558] Lost sync with device, restarting...
[ERROR] [WallTime: 1492744707.748252] Lost sync with device, restarting...
[ERROR] [WallTime: 1492744722.751666] Lost sync with device, restarting...
[ERROR] [WallTime: 1492744737.753845] Lost sync with device, restarting...
[ERROR] [WallTime: 1492744752.758321] Lost sync with device, restarting...
[ERROR] [WallTime: 1492744767.760632] Lost sync with device, restarting...
[ERROR] [WallTime: 1492744782.764312] Lost sync with device, restarting...
[ERROR] [WallTime: 1492744797.766584] Lost sync with device, restarting...
[ERROR] [WallTime: 1492744812.769104] Lost sync with device, restarting...
[ERROR] [WallTime: 1492744827.771741] Lost sync with device, restarting...
[ERROR] [WallTime: 1492744842.774092] Lost sync with device, restarting...
[ERROR] [WallTime: 1492744857.777418] Lost sync with device, restarting...
[ERROR] [WallTime: 1492744872.779524] Lost sync with device, restarting...
[ERROR] [WallTime: 1492751862.259719] Lost sync with device, restarting...
[ERROR] [WallTime: 1492759060.403469] Lost sync with device, restarting...
[ERROR] [WallTime: 1492759075.406117] Lost sync with device, restarting...
[ERROR] [WallTime: 1492759090.408374] Lost sync with device, restarting...
[ERROR] [WallTime: 1492759105.411778] Lost sync with device, restarting...
[ERROR] [WallTime: 1492759120.414319] Lost sync with device, restarting...
[ERROR] [WallTime: 1492759135.417050] Lost sync with device, restarting...
[ERROR] [WallTime: 1492759150.421139] Lost sync with device, restarting...
[ERROR] [WallTime: 1492759165.424264] Lost sync with device, restarting...
[ERROR] [WallTime: 1492766262.802228] Lost sync with device, restarting...
[ERROR] [WallTime: 1492766277.805069] Lost sync with device, restarting...
[ERROR] [WallTime: 1492766292.808092] Lost sync with device, restarting...
[ERROR] [WallTime: 1492766307.810252] Lost sync with device, restarting...
[ERROR] [WallTime: 1492766322.813159] Lost sync with device, restarting...
[ERROR] [WallTime: 1492766337.816054] Lost sync with device, restarting...
[ERROR] [WallTime: 1492766352.819220] Lost sync with device, restarting...
[ERROR] [WallTime: 1492766367.822204] Lost sync with device, restarting...
[ERROR] [WallTime: 1492766382.825883] Lost sync with device, restarting...
[ERROR] [WallTime: 1492766397.828340] Lost sync with device, restarting...
[ERROR] [WallTime: 1492766412.831135] Lost sync with device, restarting...
[ERROR] [WallTime: 1492766427.833176] Lost sync with device, restarting...
[ERROR] [WallTime: 1492766442.836600] Lost sync with device, restarting...
[ERROR] [WallTime: 1492766457.841149] Lost sync with device, restarting...
[ERROR] [WallTime: 1492766472.844139] Lost sync with device, restarting...
[ERROR] [WallTime: 1492766487.846746] Lost sync with device, restarting...
[ERROR] [WallTime: 1492766502.849327] Lost sync with device, restarting...
[ERROR] [WallTime: 1492766517.852130] Lost sync with device, restarting...
[ERROR] [WallTime: 1492766532.855367] Lost sync with device, restarting...
[INFO] [WallTime: 1492737565.394214] Setup publisher on /sensors/ds18b20_1/temperature/raw [std_msgs/Float32]
[INFO] [WallTime: 1492737565.415554] Setup publisher on /sensors/mhz16_1/air_carbon_dioxide/raw [std_msgs/Int32]
[INFO] [WallTime: 1492737565.431446] Setup publisher on /sensors/am2315_1/air_humidity/raw [std_msgs/Float32]
[INFO] [WallTime: 1492737565.452618] Setup publisher on /sensors/am2315_1/air_temperature/raw [std_msgs/Float32]
[INFO] [WallTime: 1492737565.468636] Setup publisher on /sensors/water_level_sensor_high_1/is_on/raw [std_msgs/Bool]
[INFO] [WallTime: 1492737565.488649] Setup publisher on /sensors/atlas_ph_1/water_potential_hydrogen/raw [std_msgs/Float32]
[INFO] [WallTime: 1492737565.513985] Setup publisher on /sensors/atlas_ec_1/water_electrical_conductivity/raw [std_msgs/Float32]
[INFO] [WallTime: 1492744872.816235] Setup publisher on /internal_diagnostics [openag_brain/DiagnosticArray]
[INFO] [WallTime: 1492744872.836352] Setup publisher on /sensors/ds18b20_1/temperature/raw [std_msgs/Float32]
[INFO] [WallTime: 1492744872.852310] Setup publisher on /sensors/mhz16_1/air_carbon_dioxide/raw [std_msgs/Int32]
[INFO] [WallTime: 1492744872.873573] Setup publisher on /sensors/am2315_1/air_humidity/raw [std_msgs/Float32]
[INFO] [WallTime: 1492744872.889541] Setup publisher on /sensors/am2315_1/air_temperature/raw [std_msgs/Float32]
[INFO] [WallTime: 1492744872.909584] Setup publisher on /sensors/water_level_sensor_high_1/is_on/raw [std_msgs/Bool]
[INFO] [WallTime: 1492744872.930810] Setup publisher on /sensors/atlas_ph_1/water_potential_hydrogen/raw [std_msgs/Float32]
[INFO] [WallTime: 1492744872.954822] Setup publisher on /sensors/atlas_ec_1/water_electrical_conductivity/raw [std_msgs/Float32]
[INFO] [WallTime: 1492751862.286815] Setup publisher on /internal_diagnostics [openag_brain/DiagnosticArray]
[INFO] [WallTime: 1492751862.307030] Setup publisher on /sensors/ds18b20_1/temperature/raw [std_msgs/Float32]
[INFO] [WallTime: 1492751862.324399] Setup publisher on /sensors/mhz16_1/air_carbon_dioxide/raw [std_msgs/Int32]
[INFO] [WallTime: 1492751862.344308] Setup publisher on /sensors/am2315_1/air_humidity/raw [std_msgs/Float32]
[INFO] [WallTime: 1492751862.360605] Setup publisher on /sensors/am2315_1/air_temperature/raw [std_msgs/Float32]
[INFO] [WallTime: 1492751862.381692] Setup publisher on /sensors/water_level_sensor_high_1/is_on/raw [std_msgs/Bool]
[INFO] [WallTime: 1492751862.401636] Setup publisher on /sensors/atlas_ph_1/water_potential_hydrogen/raw [std_msgs/Float32]
[INFO] [WallTime: 1492751862.422881] Setup publisher on /sensors/atlas_ec_1/water_electrical_conductivity/raw [std_msgs/Float32]
[INFO] [WallTime: 1492759165.461404] Setup publisher on /internal_diagnostics [openag_brain/DiagnosticArray]
[INFO] [WallTime: 1492759165.477664] Setup publisher on /sensors/ds18b20_1/temperature/raw [std_msgs/Float32]
[INFO] [WallTime: 1492759165.497736] Setup publisher on /sensors/mhz16_1/air_carbon_dioxide/raw [std_msgs/Int32]
[INFO] [WallTime: 1492759165.514893] Setup publisher on /sensors/am2315_1/air_humidity/raw [std_msgs/Float32]
[INFO] [WallTime: 1492759165.534905] Setup publisher on /sensors/am2315_1/air_temperature/raw [std_msgs/Float32]
[INFO] [WallTime: 1492759165.550909] Setup publisher on /sensors/water_level_sensor_high_1/is_on/raw [std_msgs/Bool]
[INFO] [WallTime: 1492759165.576206] Setup publisher on /sensors/atlas_ph_1/water_potential_hydrogen/raw [std_msgs/Float32]
[INFO] [WallTime: 1492759165.596207] Setup publisher on /sensors/atlas_ec_1/water_electrical_conductivity/raw [std_msgs/Float32]
[INFO] [WallTime: 1492766532.884668] Setup publisher on /internal_diagnostics [openag_brain/DiagnosticArray]
[INFO] [WallTime: 1492766532.901335] Setup publisher on /sensors/ds18b20_1/temperature/raw [std_msgs/Float32]
[INFO] [WallTime: 1492766532.921829] Setup publisher on /sensors/mhz16_1/air_carbon_dioxide/raw [std_msgs/Int32]
[INFO] [WallTime: 1492766532.937812] Setup publisher on /sensors/am2315_1/air_humidity/raw [std_msgs/Float32]
[INFO] [WallTime: 1492766532.959107] Setup publisher on /sensors/am2315_1/air_temperature/raw [std_msgs/Float32]
[ERROR] [WallTime: 1492773462.367955] Lost sync with device, restarting...
[ERROR] [WallTime: 1492780660.506308] Lost sync with device, restarting...
[ERROR] [WallTime: 1492780675.508669] Lost sync with device, restarting...
[ERROR] [WallTime: 1492780690.511489] Lost sync with device, restarting...
[ERROR] [WallTime: 1492780705.514986] Lost sync with device, restarting...
[ERROR] [WallTime: 1492780720.517253] Lost sync with device, restarting...
[ERROR] [WallTime: 1492780735.519357] Lost sync with device, restarting...
gordonbrander commented 7 years ago

Perhaps this is related... from @rwdavis513:

After much debugging this weekend, the pfcs with the new software wouldn't be able to access the serial port anymore after 30-45minutes. Perhaps there was something that needs to change in the cmake or it needs to be run again. Here's a post with the error: http://answers.ros.org/question/233355/rosserial-running-please/

The error:

Traceback (most recent call last):
 File "/home/pi/catkin_ws/install/lib/rosserial_python/serial_node.py", line 80, in <module>
   client = SerialClient(port_name, baud)
 File "/home/pi/catkin_ws/install/lib/python2.7/dist-packages/rosserial_python/SerialClient.py", line 382, in __init__
   self.requestTopics()
 File "/home/pi/catkin_ws/install/lib/python2.7/dist-packages/rosserial_python/SerialClient.py", line 389, in requestTopics
   self.port.flushInput()
 File "/usr/lib/python2.7/dist-packages/serial/serialposix.py", line 500, in flushInput
   termios.tcflush(self.fd, TERMIOS.TCIFLUSH)
termios.error: (5, 'Input/output error')
rwdavis513 commented 7 years ago

This appears to be a convoluted issue and should probably be separated out into several sub issues?

gordonbrander commented 7 years ago

ph pump command has caused loss of sync before. This can also be caused by a number of factors:

Basically for nearly any problem with rosserial you get the same opaque error.

gordonbrander commented 7 years ago

Duplicate has some other possibly useful info https://github.com/OpenAgInitiative/openag_brain/issues/240

ghost commented 7 years ago

@gordonbrander As requested, some debugging suggestions...

My bet is that you've got a combination of root causes from the physical layer on up to rosserial. Or, at least if you want to get to the bottom of the rosserial errors, you'll need to assume that you might have all of these problems and work through the possibilities to rule stuff out.

Ideas for relatively non-invasive diagnostic stuff:

  1. Do the math to get a ballpark estimate how much of the mega's processor time ought to be spent on serial I/O. If you know your baud rate (???), your processor speed (16MHz), and about how many bytes the protocol should be sending per second (???), you should be able to get at least a lower bound (e.g. assume receiving or transmitting 1 serial byte needs 1 AVR instruction cycle). For example, if you found that serial I/O would need a minimum of 95% of your AVR instruction cycles, then that's your problem, and you'll need to revise the design so it isn't so chatty on the serial port.

  2. Use a logic analyzer to take advantage of all the spare pins on the mega for profiling: For each firmware module, set a unique pin high when its loop function starts and low when it exits. I think a lot of your access to GPIO pins is probably blocked by the sensor board, but you might be able to partially unplug the headers just far enough to insert small probes. Or, you could tap the ribbon cable going off to the relay board. Worst case, build another shield with breakout pins to go between the mega and the sensor board.

    Your goal here is to sanity check how fast the pins are cycling through your loop. If you add up the timing delays in your firmware modules, the pins should cycle about that fast. If the cycle keeps moving, but it's too slow, you might be spending a lot of time in interrupt handlers. If it gets stuck in one firmware module, you might have blocking I/O stuck in some kind of infinite loop.

  3. Use the pin and logic analyzer trick around any code where you disable interrupts or make a call to delay(). Alternately, use an LED on this pin--if it's turned on enough for you to see it, then you're spending a whole lot of time with interrupts disabled, and it's going to mess up your serial port communications.

  4. Put a logic analyzer or scope on any output pins that are supposed to use PWM or short timed pulses (chiller, doser pumps, etc.). Look for jitter, stretching, or other glitches that indicate you might have problems with interrupt handlers stepping on each other's toes.

  5. Put a scope on your I2C SCL and SDA lines and see what the waveforms look like compared to the I2C specs. If you've got access to a logic analyzer that can decode I2C and maybe flag errors, then use that too. From what I read, it's fairly easy to rig up cables that have too much capacitance. If your rise times on SDA and SCL are marginal, you could be getting what amounts to random noise on SDA or SCL at the mega, the sensors, or all of the above. If that's happening, then sensors could be going into weird modes, and your sensor library code in the firmware will be at the mercy of whatever error handling and recovery it has (my guess is, from reading some of it, not much).