Copter: Bad GPS Health Too Aggressive

cglusky commented 4 years ago

Is your feature request related to a problem? Please describe. User gets a CRT: Bad GPS Health message if GPA Delta goes above 200ms. It's a bit scary and perhaps a bit aggressive if it just happens on occasion. I think to most users it would mean land as soon as practical with a switch to a non GPS assist mode.

This is happening on a PixRacer setup with mRo Purple GPS. It also has every serial port stuffed and full logging enabled. But it seems to get worse with more satellites so it could be similar to:

https://discuss.cubepilot.org/t/here-gps-bad-gps-health/999

So likely a serial bus a bit busy. But based on how things are trending with flow and lidar and other serial sensors, I don't think that is going to get better.

Describe the solution you'd like Filter GPA Delta to reduce triggering the message. Say if it happens three times in three seconds then consider it Bad Health. Not sure what a good filter would look like so that's just a guess.

Describe alternatives you've considered Per the cube pilot post you could change elevation mask in GPS so you are not seeing sats that are likely not aiding anyway. Assuming the delay is on the GPS side.

Platform [ ] All [ ] AntennaTracker [x] Copter [ ] Plane [ ] Rover [ ] Submarine

Naterater commented 4 years ago

AGREED. The Here+ GNSS unit does this at least 3x in an hour in my experience, and the message relayed to the GCS causes major concern when in reality there is really not a problem.

WickedShell commented 4 years ago

I'm going to strongly disagree with this. If you are exceeding this value the EKF is rejecting the data from position fusion, which means you are falling back to a no GPS mode in the EKF, in that mode clearly your GPS is unhealthy.

There are a couple of ways people typically get themselves into this:

First requesting triple constellation on a M8n GPS unit. The GPS is only specced as being able to maintain 5Hz with dual constellation, triple constellation it just can't keep up with. ~As an aside some mRo units are known to ship in this configuration which is problematic, I'm unsure if they still ship in this configuration or not.~ EDIT: This was corrected about a year ago, so my comments on the mRo side were in error.
Requesting 10Hz updates with dual constellation on a M8N. Again the GPS on a M8N is only able to do 5Hz with dual constellation (IE GPS + GLONASS). The M8Q can do 10Hz in this situation.
Enabling raw logging, which in some situations can saturate the link, which can introduce delay and jitter. Since the M8 units can't do this I'm happy to rule that out in this case.

The serial bus being busy is something I think we can rule out. A serial link is dedicated to each device, it's not a multidrop network. The only way the serial devices should be able to interfere with eachother is if processing the data off a link takes to long, and if you are getting delays of 25+ ms from processing serial data buffers then we have serious other issues to look at, and the GPS warning isn't the root problem.

Looking at the posted screenshots on discuss I'd guess that the GPS is either trying to process to many constellations and running out of processing power, and thats causing your jitter to rise once you cross a threshold number of SV's. The other one that comes to mind is the Here2 has a processor sitting inbetween the ArduPilot and uBlox chips, and there may be some weird condition inducing jitter there. But given that it appears to correspond with the number of SV's I'd guess it's to many constellations/to high an update rate.

I'll tag this as a devcall topic, but I'm pretty opposed to making this any higher. (I could see possibly raising the threshold by 5ms, but I'd really like to be able to rule out the 4Hz GPS units more generally which is why I haven't done that).

cglusky commented 4 years ago

In my case they are mRo units. I will have to plug them into uCenter to see how they are configured.

Based on what I am seeing in my logs it's a single blip going to about 300 or 350ms. Does that mean my GPS is unhealthy or just temporarily busy?

Perhaps a message saying GPS Slow Response or similar and if you get too many of those in a certain timeframe then it's not healthy?

Pedals2Paddles commented 4 years ago

As far as initializing, I don't think reporting that as unhealthy is a good thing since that is false. It's not unhealthy, it's initializing. BUT, we also there for do not know if it is healthy. So calling it healthy while initializing is also not accurate. Not Unhealthy != Healthy.

Naterater commented 4 years ago

I'm going to rule out misconfiguring unless Here+ units using default parameters (5Hz) yields consistent missing events. Remember I said maybe 3 times per hour. Not consistently every few seconds. This isn't an initializing issue IMO, it's once they are operational. Big red messages about GPS health due to a single event missing is annoying. That's 0.016% of messages at 5Hz if they happen once every 20 minutes. Is a single missing event reason for the nasty message causing user major concern?

rmackay9 commented 4 years ago

On the dev call we agreed that we could/should redesign the filter used for reporting to only report an issue if there are at least two lost messages within 30seconds. The missing GPS message should also be recorded as a counter in the PM (?) message.

Also we'd like to see a log of a message where the GPS is generally good but there is an occasional loss of a GPS message.

tridge commented 4 years ago

we really need a log showing this issue

cglusky commented 4 years ago

@tridge This should be a good sample. Just testing loiter and got Bad GPS Health via yaapu/frsky telem so switched to stabilize and landed.

https://drive.google.com/open?id=1RC-M0FBgtdqDJqpKhHscGgKUyyGLAtYf

At least I think that's one of the flights in question. Sorry, have three new devFrames on the bench I have been testing the last month and it's all a bit of a fog at this point. I can more than likely reproduce with a fresh flight if needed.

The reason it stuck out to me is it was flying great in loiter and the Bad GPS Health popped up and caused me to switch to stabilize and land. Although looking at that log it looks like I decided to bang it around a bit in stabilize before I landed.

WickedShell commented 4 years ago

@cglusky Your log is interesting, it's not a single bad reading, it's actually 2 in a row that are slow. Interestingly this actually corresponds to a jump/inconsistency in the GPS data output:

Figure_1

This would imply to me that the error is actually inside the GPS unit, and you actually did get bad data for this time. I can't see anything else yet that would explain this, but I'll keep looking.

cglusky commented 4 years ago

Thanks for having a look @WickedShell - Very interesting. One of my goals for 2020 is to become better at analyzing logs.

Before I posted this feature request I did consider abstracting it to entire alerting system. Figured it was a bit much. But I think it is worth noting as it is already well documented:

https://www.faa.gov/documentLibrary/media/Advisory_Circular/AC_25.1322-1.pdf

Specifically, my Bad GPS Health alert felt a bit binary when there are typically different levels of alerts - Warnings, Cautions and Advisory given as feedback in aviation systems. The challenge is finding the right balance which would obviously require some judgment from devs and feedback from the wider user community.

Having messages pop-up that people start to ignore are perhaps just as dangerous as no message at all, as Human factors cause most aviation accidents:

https://www.faa.gov/data_research/research/med_humanfacs/oamtechreports/2000s/media/200618.pdf

WickedShell commented 4 years ago

Part of the problem here is that the MAVLink messaging only supports a healthy/not healthy light.

Sticking with the manned aviation example though it's actually typical to have warning lights on the panel that come on, and if they persist (or are coupled with any other abnormalities) become a land immediately item, but aren't always a land immediately. The way the information is actually displayed to you also matters a lot, as it can make it harder to tell how bad it is. Some of the more popular GCS's will continue to show you the warning for 10 seconds after it's cleared, so you have a time to read it, while mine will print a warning start/stop with a timestamp, but the actual warning indicator itself will go out the moment the warning isn't valid anymore. This latter one makes it much easier to assess intermittent warnings like this.

Naterater commented 4 years ago

This conversation about "BAD" things is now on two topics. Users are tired of these "BAD" warning messages that aren't really ultimately that bad. https://github.com/ArduPilot/ardupilot/pull/13457 is a another similar topic with similar discussion.

cglusky commented 4 years ago

Just doing a bit of homework. It does appear there are some issues with how MAVLink handles alerting.

First, some of the messages got tied to an RFC for syslog.

https://tools.ietf.org/html/rfc5424 https://mavlink.io/en/messages/common.html#MAV_SEVERITY

That RFC could map to aviation standards.

And as @WickedShell pointed out the health status is binary which does not help us much... https://mavlink.io/en/messages/common.html#SYS_STATUS https://mavlink.io/en/messages/common.html#MAV_SYS_STATUS_SENSOR

And CAN nodes appear to have their own mapping... https://mavlink.io/en/messages/common.html#UAVCAN_NODE_HEALTH

Feels like the foundation is there but it just needs to be standardized. Easy for me to say.

Sekilsgs2 commented 3 years ago

Hi.

On 4.1 i'm have always this warning for 30 min - about 70 errors. 4.0.7 dont have this problems!

What i'm find - on 4.0.7 when i'm download logs from data flash - AP show this warnings - i'm think this is because downloading data from flash block other threads - and in 4.1 we have very bad optimisations when many threads cant working good with proper latency and maybe some irq's lost or have bad priority - i'm think this is main problem in 4.1 with many internal errors.

andyp1per commented 3 years ago

@Sekilsgs2 what board is this on?

Sekilsgs2 commented 3 years ago

what board is this on?

Mamba f405 mk2. Official have only 4.1, but i'm compile and using 4.0.7 now, because on 4.1 b5 i'm have crash - fc rebooting in flight..

ArduPilot / ardupilot

Copter: Bad GPS Health Too Aggressive #13459