Closed jyoung8607 closed 8 months ago
Unfortunately, your device is one of the few affected by https://github.com/commaai/openpilot/pull/25959. Blocking startup in this case was actually pretty successful in keeping the CPU/GPU <=90C. The PMIC crossed 100C within a few seconds, and I suspect that caused some of these issues, though this should be handled more gracefully.
While not explicitly logged, about 32 seconds into the drive we can infer that emergency thermal mitigation via CPU hotplug took place. Losing a substantial fraction of our compute does explain the behavior.
(openpilot-py3.8) jyoung@DESKTOP-6JPRDTA:~/openpilot/selfdrive/debug$ ./filter_log_message.py "3cfdec54aa035f3f|2023-05-15--14-59-48" | grep affine
[153674.796061] MAIN 0 kernel - IRQ237 no longer affine to CPU5
[153674.796111] MAIN 0 kernel - IRQ238 no longer affine to CPU5
[153674.796161] MAIN 0 kernel - IRQ239 no longer affine to CPU5
[153674.796400] MAIN 0 kernel - IRQ240 no longer affine to CPU5
[153674.799377] MAIN 0 kernel - IRQ241 no longer affine to CPU5
[153674.799485] MAIN 0 kernel - IRQ242 no longer affine to CPU5
[153674.799538] MAIN 0 kernel - IRQ243 no longer affine to CPU5
[153674.799587] MAIN 0 kernel - IRQ244 no longer affine to CPU5
[153674.799635] MAIN 0 kernel - IRQ245 no longer affine to CPU5
[153674.799701] MAIN 0 kernel - IRQ565 no longer affine to CPU5
[153674.799763] MAIN 0 kernel - process 229086 (selfdrive.contr) no longer affine to cpu5
[153674.810485] MAIN 0 kernel - process 229087 (selfdrive.contr) no longer affine to cpu5
[153676.025024] MAIN 0 kernel - process 229031 (camerad) no longer affine to cpu6
[153676.025519] MAIN 0 kernel - process 229109 (RoadCamera) no longer affine to cpu6
[153676.025839] MAIN 0 kernel - process 229039 (camerad) no longer affine to cpu6
[153676.025985] MAIN 0 kernel - process 229110 (WideRoadCamera) no longer affine to cpu6
[153676.026148] MAIN 0 kernel - process 229108 (DriverCamera) no longer affine to cpu6
[153676.026302] MAIN 0 kernel - process 229107 (camerad) no longer affine to cpu6
[153676.096124] MAIN 0 kernel - process 229036 (ZMQbg/IO/0) no longer affine to cpu6
[153676.637263] MAIN 0 kernel - process 229047 (_modeld) no longer affine to cpu7
Unfortunately, your device is one of the few affected by https://github.com/commaai/openpilot/pull/25959.
How many is a few? Rather than spend time on software, does it make sense for me to send this device in for rework?
@adeebshihadeh do you still consider this an open issue? All the misbehaviors probably trace back to the CPU hotplug thermal mitigation, and several recent updates to thermald make it less likely we'll reach that point.
Yes, I’d still like to handle this specific scenario better.
Thought about it more, and don't think a specific check makes sense. We already check the things that matter (processes lagging, crashing, etc.)
Describe the bug
Had openpilot freak out on me earlier today. Don't know if the overtemp factor is causative or merely correlated.
I actually could not terminate this easily without unplugging the C3. Shutting off the car didn't break the cycle, and I didn't quite manage to get the UI to scroll down to the reboot button before it "crashed" again.
Provide a route where the issue occurs
3cfdec54aa035f3f|2023-05-15--14-59-48
openpilot version
717bc04ddc330c43e794f28ee6ff3a287425112e
Additional info
Unmodified master as of a couple days ago. I haven't tried to do much analysis, other than note there aren't any UI crash dumps uploaded. It also looks like both forward cameras stopped encoding (fcam/ecam plus qcams) but dcams kept going.