bitcraze / crazyflie-lib-python

Python library to communicate with Crazyflie
Other
262 stars 895 forks source link

Implement quality stats #469

Open knmcguire opened 2 months ago

knmcguire commented 2 months ago

Cpp link already has certain quality stats implemented that looks at amount of packets sent, how many acks have been returned and the delay of the messages being received (like here: https://github.com/bitcraze/crazyflie-link-cpp/pull/28). We should mirror this in cflib such that we can look at the communication in a more qualitative way.

Currently broadcasting is not implemented in the cflib so we can't mirror that, but anything unicast we can ofcourse.

knmcguire commented 2 months ago

Some extra information in the connection.h of the statistics . We should try to replicate these same stats

Currently I'm assuming that a connection statistic is per 1 crazyradio <-> 1 crazyflie pair

knmcguire commented 2 months ago

In the cflib there are some link quality functionality already, which looks at the number of retries within a certain sliding window. So it returns a percentage of the retries done?

https://github.com/bitcraze/crazyflie-lib-python/blob/be04130ac89acdbe7bd3b26684eb95db753ee564/cflib/crtp/radiodriver.py#L610-L618

Git blame and it has been implemented 8 years ago, but a typical thing that kind of dissappeard in memory.... but it has been used in the cfclient though! See number 3 in the main gui: https://www.bitcraze.io/documentation/repository/crazyflie-clients-python/master/userguides/userguide_client/#main-ui

I guess the question is, since we've got to leave this to prevent things from breaking, would we want to add some things on the side of this since cpp link has more values? Also broadcasting hasn't been implemented yet either so we can't add all (see #472 )

knmcguire commented 2 months ago

Alright, so the thing here is, that cpp link looks at the full amount of packets send over the entire connection which is perhaps a bit much, but this looks at the retry percentage the last 100 packets which is way too little... But I kind of like the sliding window approach.

How about we collect the stats of the last 3 seconds (perhaps configurable)? And what would we like to monitor in that sense? Technically retries is an way to get a sense of the link quality as it does that if there is no ack anyway, the difference is that perhaps the crazyflie link cpp is doing the retries in code and the cflib sets the nordic chip settings for retries? Which means that there is a clear difference and perhaps we need to be talking about some standards than regular counts.

Update: cpp link does also setArc to 3 as well (see this permalink). See here the settings for cflib:

https://github.com/bitcraze/crazyflie-lib-python/blob/be04130ac89acdbe7bd3b26684eb95db753ee564/cflib/drivers/crazyradio.py#L153-L158

whoenig commented 2 months ago

Note that at the link-level it's "total" count, but then the client side can easily implement a sliding-window approach. in crazyflie-server-(cpp) it's a 1second window (which basically just looks at the difference of the "total" counts of now vs. one second ago). I believe the window-size was configurable there, too.

knmcguire commented 2 months ago

Thanks for the explanation @whoenig. We had a discussion today with @ataffanel @gemenerik @ToveRumar and we determined that the link quality as it has been implemented on the Cflib side is not a good indication of the actual link quality as it only looks at number of retries of the nrf (which is either a 100 good or bad). We could also go more towards how crazyswarm2 but we had doubts on if that would be the best values to look at, so we let that a bit lose and just made a list (from highest to lowest priority):

  1. Latency in ms (which is also in Crazyswarm2) from lib/link to STM and back
  2. Up/down link bandwidth congestion in percentage which is an indication of congestion (and more congestion means more latency)
  3. RSSI of both Crazyflie and Crazyradio (CR 2 has a better measure for RSSI than CR1)
  4. Lost uplink/downlink percentage: How many packets are lost from CR to CF (uplink) and CF to CR (downlink). The idea is to look at those separately but unfortunately we can't be sure of which was uplink or downlink unless safelink is used.

Latency There is a latency test in the crazyflie_testing that we could look at, which uses the echo channel. We can draw inspiration that in there: https://github.com/bitcraze/crazyflie-testing/blob/aeeda61e2a2ba8ef60c1e78d8aa71fdc8ad046cc/tests/QA/test_radio.py#L63-L99

It looks to the actual time that it takes for one packet to be sent and for the ack of that packet to come back. That is collected and the minimum latency is returned as result. That by itself is strange though... as I would say that the max or at least the average would be more suitable.

At one point we could differentiate between the latency of the usb radio, nrf-air-nrf, nrf-stm, but let's first get the 'big picture' latency in first.

Bandwidth There is also a bandwidth test crazyflie-testing as well! Here it uses the echo channel to congest the channel: https://github.com/bitcraze/crazyflie-testing/blob/aeeda61e2a2ba8ef60c1e78d8aa71fdc8ad046cc/tests/QA/test_radio.py#L102-L135. What it does is to send 500 packets and see how long it takes for it to return. But this says more about the packet latency average (which is related to bandwidth) than actual bandwidth congestion. Also it seems to have been disabled for the test rig, as it maybe wasn't telling that much perhaps.

@ataffanel mentioned that a better way is to look at actual null packets send for both up and down link versus all packets sends, as that will give a better idea of how congested the bandwidth actually is.

RSSI The crazyradio to crazyflie has a good RSSI measurement that we can log. For the Crazyradio PA it's only a 0-1 indication and for Crazyradio 2.0 it is more specified than that but it might be behind a build flag, but it might be good to have both measurements available.

Having said that, RSSI is perhaps a indication of how strong the signal strenght is, it does not say how congested the communication bandwidth is or how many other desturbances are in the area, so by itself it is not a good link quality indicator but it will be a good 'side kick' for the others. Moreover it will be quite easy to implement as this is an value that comes from the NRF chip directly and doesn't require a measuring window (unless the average is necessary)

uplink/downlink packet loss Currently without safelink, we will not know if a packet is lost in the uplink (Crazyflie never receives it) or the packet is lost on downlink. The retry that indicates the linkquality in the cflib is one indication but it doesn't differentiate in where it did go wrong.

This is a good metric but we have put it on the lowest priority as we think that the other values are more useful in determining actual link quality, since packetloss is something that is intergral to the NRF-NRF communication and should work the best out of all of these.

Next steps So next step is to start with latency and work our way down. Since the 3rd week is coming up, we will save this for the next development cycle starting on the 22th of October, so I'll update our planning to do this first and push the usb-protocol to after this (since we need these metrics anyway to see if we are improving anything.

whoenig commented 2 months ago

Great summary. Just a few comments.

Latency: Using the minimum is often done in the networking community. In Crazyswarm2, we "inject" the measurement once per second (user-configurable) and then report the resulting latency unfiltered. This is also done to not take away too much bandwidth just by measuring the latency.

Bandwidth: I actually like and use a simple packet count (as counted on the Crazyflie firmware itself) of received unicast and broadcast packets. This tells you a lot, especially when a radio is shared across multiple CFs. In Crazyswarm2, we show (side-by-side) the sent packets, so one can pretty easily see if there is any significant packet loss (when looking at broadcasts, at least).

I agree that RSSI doesn't really provide very meaningful values. At least good RSSI values do not always indicate a stable link. Poor RSSI values often do correlate with a poor link. I never tried to get values for up/downlink, so not sure how useful it would be in practice - neat idea, though.

knmcguire commented 2 months ago

So the packet loss is something that we can measure separately from the bandwidth congestion based on ratio of null packets send to and received from the Crazyflie.

Today I was able to do a quick proof of concept to check null packets versus total packets for both the up and downlink and it was pretty accurate. (see this commit) As soon as I used a teleop from the client the uplink congestion went up and as soon enabled a lot of logging lot, the downlink congestion went up. Once we have both latency and bandwidth congestion implemented we can see what happens if we communicate with more Crazyflies or do stress tests with the communication. But currently still ongoing work just to see if we got it at the right end.

For the latency, we choose to look at the 95 percentile instead since that is actually more common these days but I'll let @gemenerik comment on that since he's been working on that now.

knmcguire commented 2 months ago

Also, I found out that in the Cflib this piece of code with the ifstatement of the relaxation of the connection doesn't really work anymore... https://github.com/bitcraze/crazyflie-lib-python/blob/418a68160e464396d9002579c0292b59ca1e78ef/cflib/crtp/radiodriver.py#L634-L646.

Even the null packets have a data length of the channel 0xf3 with a bit countdown, so best is to check that first data point with a bitmask for the data in like:

mask = 0b11110011
empty_ack_packet = int(data[0]) & mask
if empty_ack_packet == 0xF3:
  print('empty packet!')

We should probably make an issue of this? Or do we want to even have to fix this since it perhaps never even worked in the first place maybe... we can perhaps do something better now we are getting better stats.