commaai / openpilot

openpilot is an operating system for robotics. Currently, it upgrades the driver assistance system in 275+ supported cars.
https://comma.ai/openpilot
MIT License
49.26k stars 8.98k forks source link

GM: steering faults #21557

Closed Verylukyguy closed 1 year ago

Verylukyguy commented 3 years ago

While driving, a "Steering Temporarily Unavailable" error will display and the Comma Two will no longer engage. It also show a "Cruise Fault" error.

Comma Two Dongle ID: e38a02cae1ef1e0c Route: e38a02cae1ef1e0c|2021-07-11--18-54-39 at 19:03:43 Route: e38a02cae1ef1e0c|2021-07-11--19-05-29 at 19:12:59 Timestamp: The error came at the end of the first route at 19:03:43 and pulled into a parking lot and reset the vehicle and there was another error at 19:12:59 at the end of the second route. They are from one trip, where I had to reset the vehicle and Comma after the first error. Version: Comma.ai Master .0.8.6 GMC Acadia 2018

If you pull over and turn off the vehicle and allow the vehicle and Comma device to reset, approximately three minutes, you can restart the vehicle and reengage OpenPilot.

My understanding is that this is a LKA timing issue and has been on-going on multiple GM vehicles.


This is information that I received from someone else about this issue.

The issue only occurs when a lane keep assist message is sent less than 20 milliseconds after the one before it My theory: The code currently has a buffer that contains only the last Lane Keep Assist message it does not have a queue. Another part of the software pulls this buffer close to exactly every 20 milliseconds. Should two messages arrive within a 20 millisecond period, the earlier message is probably replaced by the newer one before the steering column had a chance to read it So there's kind of two ways to fix this. Tighten up the timing - which is what the code currently does by moving the message generation onto the panda and using a timer on the panda. The other option would be to simply skip three messages until you were back at the correct counter value. Three messages is only 60 milliseconds Nobody's going to notice a darn thing the car or you.

adeebshihadeh commented 3 years ago

Please add more relevant information to the issue. How often does this happen? Is there a repeatable procedure to reproduce it? Does it affect all GMs?

Verylukyguy commented 3 years ago

It happens often. I was not watching the CAN data in real time, but, in my example route, I was accelerating from a stop when the road shifts slightly to the right. Many people with GM vehicle's (i.e. Volts and Bolts) have experienced the same issue. There is a partial community fix on JShuler's fork, but it involves a timer on the Panda and a Buffer to regulate the LKA message timing.

jyoung8607 commented 3 years ago

The explanation from jshuler is possible, but I would be somewhat surprised. If the recipient can't tolerate 50Hz (every 20ms) messages occasionally arriving 19.9ms apart, the system as a whole would be running on the ragged edge of failure at all times. CAN just doesn't offer that level of hard real time guarantee, or any protection against clock drift among nodes of the network. FlexRay gets closer to offering this.

However, that doesn't mean timing problems can't happen. There could still be an overall problem with clock drift between the EPS and openpilot. Or, I've seen at least one case where a control module expects one of its input messages to have a counter field that reflects/copies the counter of its own output messages, as an interlock or validity check. If the GM EPS truly gets upset about timing to that extent, ASCMLKASteeringCmd transmission along with its counter could be synced to every other PSCMSteeringAngle message instead of the EON/C2 openpilot clock, or similar.

I wouldn't mind taking a look at those routes, if you're willing to mark them as both preserved and public in useradmin so they don't time out of cloud storage and the community can see them.

https://my.comma.ai/useradmin/?onebox=e38a02cae1ef1e0c (this link will only work for you)

Mark the routes preserved in the main list of routes (UI will take a moment to confirm, be patient), and then click on each route and click on "make public" in turn.

Verylukyguy commented 3 years ago

I have marked those route reserved and public as well as a few more, all of them have errors at some point.

Thank you for any help that you can give.

I used a custom branch of JShulers fork for 8 months and drove thousands of miles with no errors, so I can personally verify that his code worked. Why it worked, I don't know, I'm an Insurance Examiner.

We will be traveling 2700+ miles in a little over a week and would love to use my Comma Two without this issue.

Thanks again.

JMPZ11 commented 2 years ago

I can create a device to mitigate this issue that is completely independent of OP and the Panda.

Background

The issue is partially a bug in the Panda code, partially a bug in the steering column. The Panda drops LKAS frames randomly by design and currently doesn't even look at the rolling counter value, knowingly creating gaps in the rolling counter. This should never have worked and only does because many (most?) CANBUS modules tolerate rolling counter gaps. This tolerance was taken for granted and we have been lucky until now. GM's enforcement of rolling counter sequence is not a bug - it should have been this way from the start. I would argue this should be fixed in the Panda - Comma disagrees. The other part - GM's bug - is that the problematic steering column doesn't read CANBUS messages correctly, and drops frames if they come in too fast. Adjusting OP's message rate helps, but doesn't reduce the error rate to zero.

This is no transient, dismissible fault - LKAS is dead. Imagine if this were the emergency braking system? Such a fault disables the LKAS functionality entirely without a complete power-cycle of the steering module (pull over, turn off car, open and close driver door, wait in silence for 3 minutes)

The issue is easily fixed completely on the Panda with minimal impact, however we are battling a dogmatic "safety code only on the panda" and "LKAS faults are not a safety issue". Written in stone. Think 10 commandments.

(I suspect if this fault threw a DTC we wouldn't even be having this discussion and it would already be fixed...)

Any fix will require changes in the Panda - OR another device that can act as a pacemaker for LKAS messages.

Proposed solution - "Pacemaker Dongle/device"

A pacemaker-type device is placed somewhere between the Panda and the steering module that fixes the integrity of the rolling counter values and prevents LKAS messages from arriving too quickly.

This device could be placed (off the top of my head)

There are more possibilities I'm sure, as well as pluses and minuses for each option...

The Question

Before dedicating the effort to developing such a thing, I need to know if this would be acceptable to Comma.ai.

Would you be willing to include vehicles as Community Supported if they required such a corrective device?

The entire goal is to expand the list of supported vehicles for openpilot in a way that doesn't involve custom forks. Apologies for any other vibes.

Thanks!

JMPZ11 commented 2 years ago

Also just want to make sure I get this down - it may be possible to provide "online" resetting of the PSCM simply using a switched fuse extender (similar to the L&P harness). I haven't confirmed the PSCM has it's own fuse, but it probably does...

jyoung8607 commented 2 years ago

Per our earlier conversations, we need to find something that works within Panda's role as a mostly passive circuit-breaker. I think we've all agreed on slower-send as part of the solution, so we can implement that at any time.

For the rest, I had initially thought to solve in-flight control message clipping in Panda by letting Panda keep forwarding for one or two extra messages after the transition, and this is something comma was potentially open to, pending review of the proposed implementation. I haven't had time to work on it yet.

In the meantime, @gregjhogan reached out with an extremely clever approach: use the bus 128-129-130 messages that Panda forwards back after confirming the message has passed safety controls. Have openpilot carstate watch the bus 128 version of ASCMLKASteeringCmd.RollingCounter and (with a bit of timeout/recovery logic) make sure openpilot doesn't send messages faster than appropriate, and make sure the counter value is incremented exactly +1 from the last message that was verified to pass safety.

I don't have time to work on this immediately, and probably won't for a couple weeks at least, but I wanted to write up Greg's idea really quick in case you'd like to try it for yourself.

jyoung8607 commented 2 years ago

Couldn't get this off my mind, so I put together a quick test based on stock openpilot/panda with GM send-rate slowed from 50Hz to 25Hz (with scaled ramp-rates to compensate) plus Greg's idea for dealing with Panda drops due to controls on/off messages passing in-flight. Turns out to be very simple. This has zero onroad test time, but seems legit in local and CI testing, except for the expected process replay diffs.

https://github.com/jyoung8607/openpilot/commit/ded457376ad4d85cc415867a3009f5184ff2637f

I've sent a link for this over to @Verylukyguy and we'll see if it helps him. If we need to do more here to sync counters with the stock camera at startup time, we can do that, but thus far I've not been told of major problems with faults at startup, only later while driving.

jyoung8607 commented 2 years ago

I still want a lot more testing time, but positive feedback from a 15 minute drive this morning with a lot of deliberate disengage/engage cycles. The bit of red is due to a separate, known issue in the port's handling of gas press during preEngage.

image

adeebshihadeh commented 2 years ago

Are we still seeing issues after #22404?

Verylukyguy commented 2 years ago

I got Steering Temporary Unavailable errors and LKAS fault last night.

It was on a 17 mile drive on a branch that qadmus made for me that is a clean version of 0.8.11 and his FeedForward steering function. An interesting thing is that when the Steering Temporarily Unavailable errors came up, it stopped turning, which is bad, but it cleared itself and started turning again.

It has never done anything like this before.

The following 35 mile drive had no errors or faults.

I did preserve the drive; it is the 17.4748 mile drive: e38a02cae1ef1e0c|2021-11-06--18-42-33

jyoung8607 commented 2 years ago

No confirmed hard EPS faults that I'm aware of.

@Verylukyguy told me he ran into some faults a few days ago, but they should be unrelated; I only mention it because I think he might reply to this (edit: in fact he just did). It was during a live tuning effort with @qadmus, and it was a different class of fault (temp steering fault, and I'm told it self-recovered quickly while driving). They may have been pushing some soft limits in the rack, I didn't get confirmation of what exactly they were doing. We separated temp vs permanent fault handling in #21641. This issue covers the permanent fault that can't be recovered without a restart.

jyoung8607 commented 2 years ago

For posterity:

image

Verylukyguy commented 2 years ago

Did you review the drive to actually find out why?

It was not during a live tuning effort.

The LKAS Fault did require a restart.

jyoung8607 commented 2 years ago

The drive in question isn't fully uploaded. Can you un-preserve it and make sure your device is online? Preserving ordinarily is indeed helpful, but if you preserve before all data is uploaded, what's there is frozen.

gregjhogan commented 2 years ago

With the latest panda firmware you should be able to find any blocked messages on bus 192/193/194 and if the safety mode blocking messages is causing issues you may be able to compensate in openpilot.

jyoung8607 commented 2 years ago

I just installed a clean clone of Comma's cutting-edge master branch and got two LKAS Faults in less than a half hour.

Master -0.8.11 - Lanefull

Route: e38a02cae1ef1e0c|2021-11-22--21-16-51

Logs for that route aren't fully uploaded. Can you go into useradmin for that route and hit the "upload all rlogs" button? If your device is online, it should show up right under the list of segments.

gregjhogan commented 2 years ago

I see a message blocked by panda (bus 192) on addr 0x180 @ t=4047.409

If that is the problem, it seems like it should be pretty easy to read this rejected message bus in openpilot and either resend these messages quickly (rejection is instant) or adjust the counter appropriately each time a message is blocked.

ramonbrugman commented 2 years ago

Learn to obtain the rest of the obstacles in the way :/ I saw ir happen an hour ago 😂

adeebshihadeh commented 2 years ago

@Verylukyguy can you make sure to upload all the logs, preserve, and mark public all the rotues you post here so anyone interested can look into them? Upload the logs at connect.comma.ai using the "Files" button and mark as public/preserved at useradmin.comma.ai.

Verylukyguy commented 2 years ago

Encountered another "LKAS Fault: Restart car to engage"

Comma Two

e38a02cae1ef1e0c|2022-04-08--16-31-03--0

Master v0.8.14 @ 0f61a38

Verylukyguy commented 2 years ago

At the beginning of this route, my device when into some kind of Steering Temporarily Unavailable loop.

It has never done this before and maybe there is new information to be gained from this.

Is it possible that @sshane could look into this?

e38a02cae1ef1e0c|2022-04-14--16-54-29--0

Thank you

Verylukyguy commented 2 years ago

LKAS Fault while stopped at a traffic light behind a lead.

Near the end of this route.

e38a02cae1ef1e0c|2022-04-15--16-40-17--30

*Edited to reference the correct segment of the route.

Verylukyguy commented 2 years ago

Yes, you are correct; I did not know that I could tag a segment of a route. Sorry. I see that now. Thanks.

ncoop23 commented 1 year ago

I don't know if this error is relevant but I'm adding here. Had a LKAS fault on latest master. Had something similar on release 3 over the weekend.

0def4a390f6fe5c0|2022-10-25--19-17-40--0 near the end of the drive

ncoop23 commented 1 year ago

Another LKAS error yesterday on Dec 1 or 2 master-ci 0def4a390f6fe5c0|2022-12-02--19-25-28--37

sshane commented 1 year ago

Should be fixed in master. Let me know if it isn't

stevenaa123isme commented 11 months ago

This issue is striking again on a 2022 audi q3.

sshane commented 11 months ago

@stevenaa123isme This GM issue isn't related to VW/Audi. I suggest opening a new issue, @jyoung8607 know a lot about VW and may be able to help.