commaai / openpilot

openpilot is an operating system for robotics. Currently, it upgrades the driver assistance system in 275+ supported cars.
https://comma.ai/openpilot
MIT License
49.23k stars 8.98k forks source link

openpilot failure after overtemp startup #28202

Closed jyoung8607 closed 8 months ago

jyoung8607 commented 1 year ago

Describe the bug

Had openpilot freak out on me earlier today. Don't know if the overtemp factor is causative or merely correlated.

I actually could not terminate this easily without unplugging the C3. Shutting off the car didn't break the cycle, and I didn't quite manage to get the UI to scroll down to the reboot button before it "crashed" again.

Provide a route where the issue occurs

3cfdec54aa035f3f|2023-05-15--14-59-48

openpilot version

717bc04ddc330c43e794f28ee6ff3a287425112e

Additional info

Unmodified master as of a couple days ago. I haven't tried to do much analysis, other than note there aren't any UI crash dumps uploaded. It also looks like both forward cameras stopped encoding (fcam/ecam plus qcams) but dcams kept going.

image

adeebshihadeh commented 1 year ago

Unfortunately, your device is one of the few affected by https://github.com/commaai/openpilot/pull/25959. Blocking startup in this case was actually pretty successful in keeping the CPU/GPU <=90C. The PMIC crossed 100C within a few seconds, and I suspect that caused some of these issues, though this should be handled more gracefully.

jyoung8607 commented 1 year ago

While not explicitly logged, about 32 seconds into the drive we can infer that emergency thermal mitigation via CPU hotplug took place. Losing a substantial fraction of our compute does explain the behavior.

(openpilot-py3.8) jyoung@DESKTOP-6JPRDTA:~/openpilot/selfdrive/debug$ ./filter_log_message.py "3cfdec54aa035f3f|2023-05-15--14-59-48" | grep affine
[153674.796061] MAIN 0 kernel - IRQ237 no longer affine to CPU5
[153674.796111] MAIN 0 kernel - IRQ238 no longer affine to CPU5
[153674.796161] MAIN 0 kernel - IRQ239 no longer affine to CPU5
[153674.796400] MAIN 0 kernel - IRQ240 no longer affine to CPU5
[153674.799377] MAIN 0 kernel - IRQ241 no longer affine to CPU5
[153674.799485] MAIN 0 kernel - IRQ242 no longer affine to CPU5
[153674.799538] MAIN 0 kernel - IRQ243 no longer affine to CPU5
[153674.799587] MAIN 0 kernel - IRQ244 no longer affine to CPU5
[153674.799635] MAIN 0 kernel - IRQ245 no longer affine to CPU5
[153674.799701] MAIN 0 kernel - IRQ565 no longer affine to CPU5
[153674.799763] MAIN 0 kernel - process 229086 (selfdrive.contr) no longer affine to cpu5
[153674.810485] MAIN 0 kernel - process 229087 (selfdrive.contr) no longer affine to cpu5
[153676.025024] MAIN 0 kernel - process 229031 (camerad) no longer affine to cpu6
[153676.025519] MAIN 0 kernel - process 229109 (RoadCamera) no longer affine to cpu6
[153676.025839] MAIN 0 kernel - process 229039 (camerad) no longer affine to cpu6
[153676.025985] MAIN 0 kernel - process 229110 (WideRoadCamera) no longer affine to cpu6
[153676.026148] MAIN 0 kernel - process 229108 (DriverCamera) no longer affine to cpu6
[153676.026302] MAIN 0 kernel - process 229107 (camerad) no longer affine to cpu6
[153676.096124] MAIN 0 kernel - process 229036 (ZMQbg/IO/0) no longer affine to cpu6
[153676.637263] MAIN 0 kernel - process 229047 (_modeld) no longer affine to cpu7

Unfortunately, your device is one of the few affected by https://github.com/commaai/openpilot/pull/25959.

How many is a few? Rather than spend time on software, does it make sense for me to send this device in for rework?

jyoung8607 commented 1 year ago

@adeebshihadeh do you still consider this an open issue? All the misbehaviors probably trace back to the CPU hotplug thermal mitigation, and several recent updates to thermald make it less likely we'll reach that point.

adeebshihadeh commented 1 year ago

Yes, I’d still like to handle this specific scenario better.

adeebshihadeh commented 8 months ago

Thought about it more, and don't think a specific check makes sense. We already check the things that matter (processes lagging, crashing, etc.)