UbiquityRobotics / ubiquity_motor

Package that provides a ROS interface for the motors in UbiquityRobotics robots
BSD 3-Clause "New" or "Revised" License
24 stars 23 forks source link

Baud rate bug #154

Closed MoffKalast closed 2 years ago

MoffKalast commented 2 years ago

Yes it's still a thing, at least on the kinetic image. It especially has a propensity to happen during client support calls.

Are we sure we couldn't add some kind of reconnect attempt assuming the connection fails? Does that not solve anything?

mjstn commented 2 years ago

I have seen this happen after a shutdown -r now on the 2021-9-30 ez-map image. it is intermittant and only once so far but the telltale sign is massive number of ROS logs pound into the log with the word in upper case of REJECT

mjstn commented 2 years ago

The best comments on debug of this were from issue 95 Below is the good comment from #95

We have a semi-rare intermittent bug where ubiquity_motor starts an instance of motor_serial near line 90 in motor_hardware.cc.

We have seen that sometimes after many starts from cold power off we get wheels dont work.

Symptoms:

ROS topic /diagnostics will not be publishing items for 'Firmware Version' or 'Firmware Date' and more because motor_node cannot talk to the MCB The ROS log will be riddled with a great many serial errors seen with a reject message with REJECT in the message The serial in and serial out leds look normal and keep blinking An oscilloscope looking t the serial in to the board from the PI seen on pin 5 of the 6 pin FDDI jack, P508 will show the Pi is running at 115200 buad and not 38400 baud. So pulse widths seen on that pin will be about 8.7 micro sec, not 26 micro sec sudo stty -F /dev/ttyAMA0 will show 38400 baud but that is clearly not the rate the Pi is sending at How to reproduce this bug: (This defect may not happen at all or may happen a high percentage of the time)

Remove the delay in ubiquity_motor code for motor_hardware.cc very near line 90 if it is in your code. I am removing the extra ROS_INFO statements AND the ros:Duration line so it looks like the code did when we discovered this bug. We keep this code in our repository because it has been seen to help avoid the bug so to better force the bug e revert this out.

From a cold all off state power up the robot. Wait till serial is initialized which on rev 5.2 boards and later can be seen by the point where the SIN Led, L702, starts to blink very fast. This led is near center top just below the Aux jack 12V pin. Note that a second little led called SOUT blinks right away but the SIN led typically takes 30 sec or more to start blinking.

After this point there are several ways to see if the bug is active. The simplest way is to tell the robot to move and if it does NOT move then check ros log for a great many words of 'REJECT'. Another way is use 'rostopic echo /diagnostics | grep -A 2 -e 'Firmware Date'' and if you see no output check for 'REJECT' message.

Software Seen To Show This Defect:

This defect has been seen on a Magni with rev 5.2 MCB running a clean 2020-02-10 image on a Raspberry Pi 3B+ cpu Workaround That generally works:

If the motor_hardware.cc source file inserts a 2 second delay prior to setup of serial instance this usually fixes the bug. One unit had to have a 4 second delay. The delay is placed near line 90 in the file. A 0.5 sec delay has been show to fix the problem on some units.

MoffKalast commented 2 years ago

Perhaps as a current workaround we could have the system autoreboot once it finds a REJECT message, in that case the user would think the startup just took longer and it would no longer be a major issue.

After all that's the only thing one can really do to fix it if it happens, so might as well save people the time and headache of figuring it out.

mjstn commented 2 years ago

Here is a current case where it is happening a high percentage of bootups at Calix. Contact is Eric Leal This is from a failure and it shows a 4 second delay then serial init then immedicate massive REJECT in log (the signature)

1642181235.382000035 INFO [/home/ubuntu/catkin_ws/src/ubiquity_motor/src/motor_hardware.cc:136(MotorHardware) [topics: /rosout, /diagnostics] Delay before MCB serial port initialization 1642181235.898818016 ERROR [map_route_worker.py:336(update) [topics: /ctrover1/route_list, /tf_static, /ctrover1/landmarks, /ctrover1/route_change, /rosout, /tf] 'int' object has no attribute 'translation' 1642181235.958828926 INFO [spawner:116(main) [topics: /rosout] Controller Spawner: Waiting for service controller_manager/load_controller 1642181239.232007980 ERROR [map_route_worker.py:336(update) [topics: /ctrover1/route_list, /tf_static, /ctrover1/landmarks, /ctrover1/route_change, /rosout, /tf] 'int' object has no attribute 'translation' 1642181240.382426581 INFO [/home/ubuntu/catkin_ws/src/ubiquity_motor/src/motor_hardware.cc:139(MotorHardware) [topics: /rosout, /diagnostics] Initialize MCB serial port '/dev/ttyAMA0' for 38400 baud 1642181240.383220555 INFO [/home/ubuntu/catkin_ws/src/ubiquity_motor/src/motor_hardware.cc:144(MotorHardware) [topics: /rosout, /diagnostics] MCB serial port initialized 1642181240.390080987 WARN [/home/ubuntu/catkin_ws/src/ubiquity_motor/src/motor_serial.cc:120(SerialThread) [topics: /rosout, /diagnostics, /ctrover1/left_error, /ctrover1/right_error, /ctrover1/left_tick_interval] REJECT 00 1642181240.390333614 WARN [/home/ubuntu/catkin_ws/src/ubiquity_motor/src/motor_serial.cc:120(SerialThread) [topics: /rosout, /diagnostics, /ctrover1/left_error, /ctrover1/right_error, /ctrover1/left_tick_interval] REJECT 00

mjstn commented 2 years ago

Here is a working case (as far as baud rate goes)

1642182252.726320665 INFO [/home/ubuntu/catkin_ws/src/ubiquity_motor/src/motor_hardware.cc:136(MotorHardware) [topics: /rosout, /diagnostics] Delay before MCB serial port initialization 1642182253.265739917 INFO [spawner:116(main) [topics: /rosout] Controller Spawner: Waiting for service controller_manager/load_controller 1642182253.838459014 ERROR [map_route_worker.py:336(update) [topics: /ctrover1/route_list, /tf_static, /ctrover1/landmarks, /ctrover1/route_change, /rosout, /tf] 'int' object has no attribute 'translation' 1642182254.970916570 WARN [/opt/ros/kinetic/include/tf/message_filter.h:496(MessageFilter::checkFailures) [topics: /rosout, /tf, /ctrover1/pose, /ctrover1/map, /ctrover1/distance, /ctrover1/patch, /ctrover1/poses, /ctrover1/path, /ctrover1/iris_lama_ros/is_mapping, /ctrover1/landmarks, /ctrover1/landmark_map] MessageFilter [target=odom ]: Dropped 100.00% of messages so far. Please turn the [ros.iris_lama_ros.message_notifier] rosconsole logger to DEBUG for more information. 1642182257.171041965 ERROR [map_route_worker.py:336(update) [topics: /ctrover1/route_list, /tf_static, /ctrover1/landmarks, /ctrover1/route_change, /rosout, /tf] 'int' object has no attribute 'translation' 1642182257.726574953 INFO [/home/ubuntu/catkin_ws/src/ubiquity_motor/src/motor_hardware.cc:139(MotorHardware) [topics: /rosout, /diagnostics] Initialize MCB serial port '/dev/ttyAMA0' for 38400 baud 1642182257.727297576 INFO [/home/ubuntu/catkin_ws/src/ubiquity_motor/src/motor_hardware.cc:144(MotorHardware) [topics: /rosout, /diagnostics] MCB serial port initialized

mjstn commented 2 years ago

A key clue is the delay hack in motor_hardware.cc near line 136. We need to again put effort into getting this to happen, perhaps by delay removal or use of other values like 1 or 2 sec and so on. We need a way to aggrivate this then fix root cause.

// Insert a delay prior to serial port setup to avoid a race defect. // We see soemtimes the OS sets the port to 115200 baud just after we set it ROS_INFO("Delay before MCB serial port initialization"); ros::Duration(5.0).sleep(); ROS_INFO("Initialize MCB serial port '%s' for %d baud" ...

mjstn commented 2 years ago

Adding branch addSerialErrorManagement This minimizes the massive log spamming from the baud rate bug while adding a couple error count readers for serial class and allows clearing of error counters if desired.

mjstn commented 2 years ago

Adding Vid not really as owner but so he can better track this issue

mjstn commented 2 years ago

Desparate Thought: Do things to set default bauds all to 38400 whereever possible how about in /boot/config.txt uncomment and set baud to 38400

Darn formatting is preventing full text so this line gets uncommented like below init_uart_baud=38400

mjstn commented 2 years ago

Now working on and adding serial port re-initialize ability in branch addSerialErrorManagement This is enabled when motor_control.cc has non-zero REINIT_MCB_SERIAL_REJECT_COUNT

mjstn commented 2 years ago

I am unable to get the failure after 20 tries. I have asked the customer to do the /boot/config.txt mod as it seems very safe and the customer gets this failure frequently.

JanezCim commented 2 years ago

as requested in https://github.com/UbiquityRobotics/robot_testing/issues/18#issuecomment-1034137361 ive set init_uart_baud=38400 for all UR images. Do you need the image with this fox uploaded for testing? If yes, please tell if if you need base or dev image (dev has desktop manager)