Dasharo / dasharo-issues

The Dasharo issue tracker
https://dasharo.com/
25 stars 0 forks source link

NovaCustom V560TNE bricks most likely due to overheating #1008

Closed filipleple closed 2 months ago

filipleple commented 3 months ago

Component

Dasharo firmware

Device

NovaCustom V56 14th Gen

Dasharo version

v0.9.1-rc2

Dasharo Tools Suite version

--

Test case ID

--

Brief summary

NovaCustom V560TNE bricks most likely due to overheating, usually when rebooting after flashing

How reproducible

90%

How to reproduce

Turn on any regression suite.

Expected behavior

The suite should pass

Actual behavior

After a few tests and flashes, the laptop will refuse to boot up and will become very hot. It will boot up again after a ~5-10 minute cooldown period

Screenshots

--

Additional context

So far we can reliably reproduce this on one unit only.

Solutions you've tried

Re-applying thermal paste

philipandag commented 3 months ago

I might have found a clue here. The THR002.001 case sets a temperature threshold to 70.0C, but fails and the CPU is reported to have 80.0C. After the test the device did not turn back on. Maybe the device for some reason does not reboot after setting the threshold and only reboots after running the stress test which causes it to not be able to boot until the cpu temperature drops?

------------------------------------------------------------------------------
THR002.001 Try to enter a threshold value within the limits and ve... ....
Checking if stress-ng is installed...

Package stress-ng is installed
THR002.001 Try to enter a threshold value within the limits and ve... | FAIL |
'80.0 < 73' should be true.
------------------------------------------------------------------------------
Dasharo-Compatibility.Cpu-Throttling                                  | FAIL |
3 tests, 0 passed, 1 failed, 2 skipped
==============================================================================

(... skips)

==============================================================================
DDET001.001 USB Stack disable :: Test disabling the USB stack         | FAIL |
OSError: Socket is closed
------------------------------------------------------------------------------
DDET002.001 USB Stack enable :: Test enabling the USB stack           | FAIL |
OSError: Socket is closed
------------------------------------------------------------------------------
DDET003.001 Usb Devices Detected In Firmware Warmboot :: Test if U... ..^CSecond signal will force exit.
DDET003.001 Usb Devices Detected In Firmware Warmboot :: Test if U... | FAIL |
Execution terminated by signal

(exited at this moment because device does not boot)
philipandag commented 3 months ago

After running dasharo-security/wifi-bluetooth-switch.robot the laptop got "bricked" again. (The suite failed because Flashing via SSH is still interrupted and running tests via DCU is not possible rn) I have pressed FN+1 and to my suprise the fans started blowing at maximum speed. I left the laptop like that without touching anything else, no powering down, disconnecting battery, pressing anything else etc. After a couple of minutes messages from systemd about GPU started appearing and after another couple minutes the laptop rebooted and was working properly.

image

philipandag commented 3 months ago

I would like to point out that pressing Fn+1 for the second time to DISABLE the high performance mode causes one of the fans to slow down immediately and the second one slows down gradually which can be easily verified audibly. Maybe it is somehow related.

wessel-novacustom commented 3 months ago

I have also seen bricks after suspending and after reboots. I the cases I saw this, ME was enabled, but a custom BIOS boot splash logo was implemented.

mkopec commented 3 months ago

Something is definitely overheating especially on the RTX 4070 models. Powering off and waiting a couple of minutes usually works

mkopec commented 3 months ago

Based on post codes I think it hangs in edk2. Need to build coreboot with EC logging enabled, EDK2 in debug mode with serial redirection enabled, and EC with parallel debugger enabled. Then just check in logs where it hangs.

wessel-novacustom commented 2 months ago

We have found that the V560TNE bricks when suspending with the default kernel of Ubuntu 24.04 LTS, fully updated.

The same happens after trying to integrate a custom boot logo with DTS (RC) on the V560TND.

This issue should be top-priority.

wessel-novacustom commented 2 months ago

I can confirm that the laptop that didn't turn on after suspending could be turned on again once it was cooled down.

philipandag commented 2 months ago

The issues with suspension are fixed by upgrading the kernel to 6.9. Changing the boot logo was working fine today on our V560TNE with v0.9.1-rc4 and kernel 6.9.

https://docs.dasharo.com/unified/clevo/post-install/#linux

On Gen 14 (Meteor Lake), it's recommended to install the Ubuntu mainline kernel, which is a newer version than the default Ubuntu kernel. This version contains additional fixes for newer hardware which helps with power management and suspend on Gen 14 laptops.

macpijan commented 2 months ago

@filipleple @philipandag But you still do face bricks when using linux 6.9, just not after suspend?

philipandag commented 2 months ago

Yes, although they are much rarer.

mkopec commented 2 months ago

Very likely to be fixed by https://github.com/Dasharo/ec/commit/3786c8ce8a2562e651c180f094939fae861b20d1 .

The ME_WE pin was floating, and in some conditions (depending on temperature, but also possibly other factors) it would be sampled high instead of low, which in turn caused ME to enter FDOPSS state. When in FDOPSS, sending the End-of-post HECI command would fail, and coreboot would refuse to boot, because booting without sending EOP is considered insecure.

The pin was configured as input, because we got the GPIO config from previous firmware, and missed this error during review.

philipandag commented 2 months ago

I am testing the v0.9.1-rc5 on V540TND since yesterday and no bricks happened. Tried quick reboots, consecutive reflashes and stressing the hardware to make it hot. It seems that, at least the V540TND, doesn't have this issue with the newest rc5.

wessel-novacustom commented 2 months ago

I am testing the v0.9.1-rc5 on V540TND since yesterday and no bricks happened. Tried quick reboots, consecutive reflashes and stressing the hardware to make it hot. It seems that, at least the V540TND, doesn't have this issue with the newest rc5.

Same here for rc5 on the V560TND and V560TNE.

mkopec commented 2 months ago

In that case I believe we can close this issue