Open LorenzMeier opened 5 years ago
@baumanta I would suggest a review session with @bkueng and @julianoes of the current instrumentation and to add mavlink drop counts (when the MAVLink sequence number has holes) to the telemetry_status log message.
@bkueng @julianoes We need something like RADIO_STATUS as a new MAVLink packet that is sent back to the autopilot by mavlink_router which contains the instrumentation data of how much packet drop it sees and which RX and TX rates it sees. That would allow us to see what both sides of the link see and infer where issues are. QGC should also send this message back, which would then allow us to locate breakages.
@catch-twenty-two I'm assigning you to this. Please re-assign if you switch context to something else.
@bkueng @julianoes We need something like RADIO_STATUS as a new MAVLink packet that is sent back to the autopilot by mavlink_router which contains the instrumentation data of how much packet drop it sees and which RX and TX rates it sees. That would allow us to see what both sides of the link see and infer where issues are. QGC should also send this message back, which would then allow us to locate breakages.
This would also enable us to implement congestion control at the Mavlink level for things like UDP or serial without CTS/RTS or RADIO_STATUS. https://github.com/PX4/Firmware/issues/10618#issue-365452413
@dogmaphobic FYI
@dagar I won't be able to get to this until next week or so. What would be the ideal information you want back from the (any) router/bridge? And QGC while at it...
@dagar I won't be able to get to this until next week or so. What would be the ideal information you want back from the (any) router/bridge? And QGC while at it...
The data in the QGC connection status would be a good start.
Where this potentially gets a bit messy is routing. I'm also wondering if the uint8_t seq is insufficient beyond the simple point to point slow serial radio case.
The data in the QGC connection status would be a good start.
I assume a new set of message to convey this data?
Where this potentially gets a bit messy is routing. I'm also wondering if the uint8_t seq is insufficient beyond the simple point to point slow serial radio case.
It is an issue if more than 255 messages are lost in one go. The code detects wrap-arounds and take them into account, which is ok if only < 255 sequential messages are lost. Making it 16 bits would decrease the possible wrap around errors considerably but that involves changing a very basic counter in all MAVLink messages. A bit too much of a change, no?
Where this potentially gets a bit messy is routing. I'm also wondering if the uint8_t seq is insufficient beyond the simple point to point slow serial radio case.
It is an issue if more than 255 messages are lost in one go. The code detects wrap-arounds and take them into account, which is ok if only < 255 sequential messages are lost. Making it 16 bits would decrease the possible wrap around errors considerably but that involves changing a very basic counter in all MAVLink messages. A bit too much of a change, no?
Yes that would be a difficult change, I'm just concerned that's actually not that many messages to miss on these fast connections. Obviously this needs to be tested one way or the other, but maybe sending the overall sent and received count in a mavlink (or link?) status message would help fill the gap (if any).
Update: WIP branch https://github.com/mavlink/mavlink/tree/pr-mavlink_link_status
Have you found any workarounds for this issue? I'm currently facing a similar issue when running Mavros. The issue happens consistently when the Pixhawk is rebooted, and sometimes it happens after just running for some time. We see "TM : RTT too high for timesync" on the Mavros side, and invoking mavlink status in NSH, we see that packets are being dropped and the "tx rate mult" coefficient decreases to less than 1.
This issue has been automatically marked as stale because it has not had recent activity. Thank you for your contributions.
The current MAVLink UART link implementation is optimized for low-bandwidth links and telemetry and hasn't been originally architected for log streaming or other high-throughput applications. This introduces a number of artefacts when running a fast link:
Instead of jumping on all this and never get out of WIP state, I want to address this in a structured fashion: