Be able to change mission_item retry timeout for long distance communication

Crowdedlight commented 5 years ago

Describe problem solved by the proposed feature This will avoid flooding the telemetry link with multiple requests due to longer communication times. We experienced the following when doing mission upload at further distances (50m-200m from ground-control).

--> Sending mission count
<-- Getting mission_request_int(0)
--> Sending mission_item_int(0)
<-- Getting mission_request_int(0)
<-- Getting mission_request_int(0)
<-- Getting mission_request_int(1)
--> Sending mission_item_int(1)
<-- Getting mission_request_int(1) ...
<-- Getting mission_accepted
<-- Getting mission_accepted
<-- Getting mission_accepted (Often end up getting 4-5 acknowledges back)

So due to the timeout and the time it takes to send the information we are getting flooded with requests for the same already sent item. If my ground-control reacts on every request, as the protocol dictates, then I will have sent the same item multiple times, which then gives more answers back. In total this delayed our mission upload by 5-8seconds due to telemetry bandwidth. The mission was only 6 waypoints. If we changed so the ground-control only sent every item once, the mission upload would be 5-8s faster. This is a major gain when doing dynamic path-planning.

Describe your preferred solution Make it possible to set a parameter that determines the timeout of mission items. Make the user able to optimise the timeout based on his ground-control, distance and use-case.

Describe possible alternatives Increasing the timeout for mission_timeouts in the code. However, this would break following the default Mavlink protocol, and thus making it a parameter would probably be a better solution.

Additional context As I could see, then other commands have a timeout of 5s, and thus should be fine for longer distances. We so far only experienced the problem with the mission_upload.

dagar commented 5 years ago

I'll look at the timeout configurability on both sides (PX4 + QGC), but could you also provide more details of your setup?

What's the data link (radio, wiring, flow control?) and how is it configured (mode, data rate, etc)? I'm wondering if a certain portion of this is mavlink congestion control in general.

Mission Protocol for reference - https://mavlink.io/en/services/mission.html

Crowdedlight commented 5 years ago

Setup Using two SiK mRo modules 433MHz: https://store.mrobotics.io/mRo-SiK-Telemetry-Radio-V2-433Mhz-p/mro-433sikv2-mr.htm

Modules are using stock antennas on GCS side, connected with USB directly to the computer. The other module is connected to TELEM1 on the Pixhawk. Antenna on drone is pointing straight down for better reception over distance. Antenna on GCS is elevated about 0.5m above ground with antenna pointing straight up.

Settings for the modules are posted in the image below. You do have a good point about flow parameters. We haven't tried to change flow-control parameters.

As we use our own developed GoundStation it could have a bug that causes it, however with a direct USB connection to the PX4 we do not experience this error. Then everything works as intended with the small timeout.

I could test it through QGroundcontrol if there is a way to see how many requests I receive from the UAV when uploading missions?

mro_settings 1

Tested This was experienced while drone was flying 50-60m away from GCS. But it was also experienced inside room with antennas being 5-7m away from each other.

Another group in the same project experienced the same issue with their identical radios and PX4. They tested on their platform in the same conditions as we did.

LoRa As projects begin to push for BVLOS flights over multiple kilometres using LoRa modules for telemetry, I believe the need for setting timeouts optimized to your data-link is necessary. LoRa modules with spreading factor 9 and bandwidth of 125kHz, gives an air-time for a mission_item_int package(45 total bytes) of 369.6ms. increasing the spreading factor more for longer range increases the air-time substantially. This shows that alone the air-time delay for long BVLOS applications could trigger the current timeout.

We didn't manage to test this issue on our LoRa modules (Adafruit Feather 32u4 LoRa), due to other issues with them. However the point about LoRa and BVLOS still stands.

Calculations based on this sheet: https://docs.google.com/spreadsheets/d/1QvcKsGeTTPpr9icj4XkKXq4r2zTc2j0gsHLrnplzM3I/edit#gid=0

dagar commented 5 years ago

@Crowdedlight do you have a log corresponding to the test?

Each mavlink instance in PX4 publishes telemetry_status, which will tell us a bit more about how things are working in your particular setup.

Crowdedlight commented 5 years ago

@dagar I do not have a usable log from the test with flight 50-60m away while uploading as we had enabled custom logging for irlock but hadn't realised we had to add defaults topics as well.

I have just recreated the problem inside, with another pixhawk, but the same modules.

Test setup Pixhawk powered on. Using the two SiK mRo modules to talk to computer. img_20181222_124052

Console output Out groundcontrol software gave this output in terms of what requests it received: Btw, we are using Mavlink 1.0 Protocol. Haven't made the switch to 2.0 yet.

[ INFO] [1545478738.122716934]: Serial Port initialized
[ INFO] [1545478754.722952674]: Sending mission clear all
[ INFO] [1545478755.082871285]: MAV_MISSION_ACCEPTED
[ INFO] [1545478764.053189768]: New mission. Length:8
[ INFO] [1545478764.053221495]: Sending mission count
[ INFO] [1545478764.323083892]: Next item asked for:0
[ WARN] [1545478765.833407408]: Mission ACK timed out
[ INFO] [1545478766.383383069]: Next item asked for:1
[ INFO] [1545478766.623360547]: Next item asked for:2
[ INFO] [1545478766.853411114]: Next item asked for:3
[ INFO] [1545478766.933048111]: Next item asked for:4
[ INFO] [1545478767.302966269]: Next item asked for:4
[ INFO] [1545478767.312897644]: Next item asked for:5
[ INFO] [1545478767.512992749]: Next item asked for:5
[ INFO] [1545478767.513084487]: Next item asked for:6
[ INFO] [1545478767.653436772]: Next item asked for:6
[ INFO] [1545478767.653573871]: Next item asked for:7
[ INFO] [1545478767.873081542]: Next item asked for:7
[ INFO] [1545478767.873205688]: MAV_MISSION_ACCEPTED
[ INFO] [1545478768.103197218]: MAV_MISSION_ACCEPTED

It is not as bad as at longer distances. However there still is the double requests at times which I assume means the timeout gets triggered.

Log Do note that the GPS did not have a fix in this test, and sensors + radio was not calibrated or connected.

https://review.px4.io/plot_app?log=015ca8ec-683b-4f96-b486-34f51a04ce54

Crowdedlight commented 5 years ago

I am not sure how the etiquette is in regards to bumping issues or remove the stale mark from them. But I believe this issue to still be valid and unfixed.

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. Thank you for your contributions.

julianoes commented 5 years ago

@Crowdedlight bumping is fine, thanks. And thanks for the issue and detail. The problem is just that it's not very likely this will be fixed soon if it's not affecting us as much as you.

Looking at the problem I have one question: is it always the same requests that trigger retries or is it randomly everywhere? If it is random, I would argue that it works as it should given some messages might be dropped. If it's consistent then maybe something is not implemented correctly protocol-wise and needs fixing.

Crowdedlight commented 5 years ago

@julianoes That is fair enough. We have also mitigated it ourselves by implementing the receiving end on a companion-computer, and then do the usual upload from the companion computer to the PX4 with a cabled connection.

If you talk about the individual item request messages, it is random. I am not doubting that the mission protocol is working as intended. The issue is the delay which the telemetry link induces can be more than 250ms. (The airtime alone for LoRa communication can be more than 250ms)

When the delay is triggered, but the right item is received right after, due to the delay, it has already repeated the request for that item. As it instantly sends a request for the next item you will start to have parallel request and items going up, that isn't needed. this will saturate your wireless link and in our tests can even cause a mission to never complete the upload due to the saturated link. Some times this actually ends up with as many as 4 parallel requests/items getting transmitted at the same time. This increases the messages sent by a factor 4 and means at the end I recieve back 4 mission acknowledge messages.

The unnecessary doubling of messages really impacts long-range low bandwidth communication links.

hamishwillee commented 5 years ago

FWIW The retry behaviour is only a recommendation. @julianoes To my mind being able to tune this makes a lot of sense for medium-latency systems - it is a QoS issue (and as a side-point, any compatibility testing on the protocol should not assume retry rates are as constrained as currently.

@Crowdedlight PRs are very welcomed :-)

julianoes commented 5 years ago

this will saturate your wireless link and in our tests can even cause a mission to never complete the upload due to the saturated link. Some times this actually ends up with as many as 4 parallel requests/items getting transmitted at the same time.

Hm, this "should" not happen, that's terrible. We need to either fix the protocol or the implementation (or both). I would need to look at it in detail.

And a parameter to at least tweak the timeout would also make sense.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. Thank you for your contributions.

PX4 / PX4-Autopilot

Be able to change mission_item retry timeout for long distance communication #11078