PX4 / PX4-Autopilot

PX4 Autopilot Software
https://px4.io
BSD 3-Clause "New" or "Revised" License
8.17k stars 13.36k forks source link

[Bug] `uavcan_status` command causes CAN errors due to printf calls while mutex locked #22871

Open oystub opened 5 months ago

oystub commented 5 months ago

Describe the bug

Running the uavcan status command to check status and error counts of the uavcan module causes actual CAN errors to occur. This can be extremely confusing when debugging CAN bus errors.

Diagnosis

The problem is in the function UavcanNode::print_info(), which holds a mutex lock on _node_mutex during the entire runtime of the function. However, due to a lot of printf calls, the lock is held for a long time, blocking other UavcanNode function calls, which causes errors to occur.

https://github.com/PX4/PX4-Autopilot/blob/1ad83a80025c6ec719eb5c2ab2025619c7eff497/src/drivers/uavcan/uavcan_main.cpp#L1025-L1096

On a CubeOrange flight controller with 1.15 firmware, the runtime of the function is ~110 ms ., which is way too slow. Of note, _mixing_interface_esc.mixingOutput().printStatus(); and _mixing_interface_servo.mixingOutput().printStatus(); contribute to over 60% of the runtime. In my testing, the likelyhood/amount of errors caused seems to be closely related to the runtime and the amount of traffic on the bus.

I haven't tested extensively on how long of a lock duration is "OK", to not provoke errors even under heavy load, but I get errors even from 20 ms lock duration. Luckily, by doing printf calls without holding the lock, the lock duration can be orders of magnitude shorter.

Solution:

I am working on a solution to this. In order to resolve this properly, the code needs to:

  1. Lock mutex
  2. Copy all relevant info from mutex-guarded variables
  3. Release the mutex
  4. Do printf with the captured data.

However, I am a bit unsure on how to implement this in a nice way, while keeping all the current output. It either involves a lot of temporary variables/structs for storing the data. Also, some of the print statements are nested within several layers of function calls, and even across other modules, such as the esc and servo printStatus() functions mentioned above, which are in the mixer module, and the virtual IUavcanSensorBridge::print_status() function, implemented for several different sensor types. Either, we need to define status structs, or make it safe to call print_status(); on them without lock.

Also, does anyone have a good overview on what variables are guarded by this mutex? Is there a px4 convention for indicating this? I am unsure if all the variables and calls performed actually need to be guarded.

To Reproduce

Starting from default configuration (which has UAVCAN_ENABLE=2)

  1. Connect a UAVCAN device.
  2. Run uavcan status in the MAVLink Console. Observe that the displayed error count is closely linked to how many times uavcan status is invoked.

Expected behavior

uavcan_status should not cause errors on the CAN Bus.

Screenshot / Media

Here, two Here4 GPS modules are connected, one to each CAN port of the CubeOrange. Observe how the error count increases only when running uavcan_status. The initial error count of 31 happens during system boot, and is normal. However, the increase after this is caused by running uavcan status.

https://youtu.be/q3-vChqiNmg

Flight Log

N/A

Software Version

Affects all versions 1.9.0 -> current (as of 2024-03-12)

I have tested specifically on: PX4 git-hash: c5fde63440b5c90413c94a2b084b253b876246b7 PX4 version: 1.15.0 40 (17760320)

Flight controller

Cube Orange

DronecodeBot commented 5 months ago

This issue has been mentioned on Discussion Forum for PX4, Pixhawk, QGroundControl, MAVSDK, MAVLink. There might be relevant details there:

https://discuss.px4.io/t/px4-community-q-a-march-13-2024/37190/2

dakejahl commented 5 months ago

Lock mutex Copy all relevant info from mutex-guarded variables Release the mutex Do printf with the captured data.

Yup do this.

Also, some of the print statements are nested within several layers of function calls, and even across other modules, such as the esc and servo printStatus()

Lock/unlock before/after the printStatus() call. Same goes for the _sensor_bridges print_status().

For the sake of a uavcan status call (likely manually typed) I don't think it really matters if all the data is captured under a single lock (synchronized state).

oystub commented 5 months ago

@dakejahl

Also, some of the print statements are nested within several layers of function calls, and even across other modules, such as the esc and servo printStatus()

Lock/unlock before/after the printStatus() call. Same goes for the _sensor_bridges print_status().

The problem here is that the each individual printStatus() calls takes too much time in order to not cause errors, so it really isn't an option to hold the lock during that call. Most of the required fields are available as public get functions anyway, so I will probably use the same procedure of getting info and then printing from the print_info command itself. Also, usless we also wait after releasing the mutex, there is no guarantee that we don't just keep the lock?

While holding the mutex, no esc or servo updates will be done, so if someone were to run this command to check can bus status while in flight, it would be a huge issue. In addition to the printf calls taking a lot of time the common mutex creates a big priority inversion problem. I will also look a bit into running uavcan status at a higher priority.

Also, how much of a problem would it be to just not do mutex locking during the status print? As you say, we don't care about capturing the data under a single lock. The status printing is only doing reads, and except for the interface error and transfer counts, the longest value is a uint16_t, so we shouldn't get garbled data from data races. This gets rid of the mutex locking overhead as well, but it is really hard to to properly test and document these restrictions so we don't get weird errors in the future.

In the end, I guess the good design principle here is to always make data available though getters, and never only through a "print status" function. When I have the time, I will do that for this module in a PR :)