Closed filipleple closed 2 months ago
I might have found a clue here. The THR002.001
case sets a temperature threshold to 70.0C, but fails and the CPU is reported to have 80.0C. After the test the device did not turn back on. Maybe the device for some reason does not reboot after setting the threshold and only reboots after running the stress test which causes it to not be able to boot until the cpu temperature drops?
------------------------------------------------------------------------------
THR002.001 Try to enter a threshold value within the limits and ve... ....
Checking if stress-ng is installed...
Package stress-ng is installed
THR002.001 Try to enter a threshold value within the limits and ve... | FAIL |
'80.0 < 73' should be true.
------------------------------------------------------------------------------
Dasharo-Compatibility.Cpu-Throttling | FAIL |
3 tests, 0 passed, 1 failed, 2 skipped
==============================================================================
(... skips)
==============================================================================
DDET001.001 USB Stack disable :: Test disabling the USB stack | FAIL |
OSError: Socket is closed
------------------------------------------------------------------------------
DDET002.001 USB Stack enable :: Test enabling the USB stack | FAIL |
OSError: Socket is closed
------------------------------------------------------------------------------
DDET003.001 Usb Devices Detected In Firmware Warmboot :: Test if U... ..^CSecond signal will force exit.
DDET003.001 Usb Devices Detected In Firmware Warmboot :: Test if U... | FAIL |
Execution terminated by signal
(exited at this moment because device does not boot)
After running dasharo-security/wifi-bluetooth-switch.robot
the laptop got "bricked" again. (The suite failed because Flashing via SSH is still interrupted and running tests via DCU is not possible rn)
I have pressed FN+1 and to my suprise the fans started blowing at maximum speed. I left the laptop like that without touching anything else, no powering down, disconnecting battery, pressing anything else etc. After a couple of minutes messages from systemd about GPU started appearing and after another couple minutes the laptop rebooted and was working properly.
I would like to point out that pressing Fn+1 for the second time to DISABLE the high performance mode causes one of the fans to slow down immediately and the second one slows down gradually which can be easily verified audibly. Maybe it is somehow related.
I have also seen bricks after suspending and after reboots. I the cases I saw this, ME was enabled, but a custom BIOS boot splash logo was implemented.
Something is definitely overheating especially on the RTX 4070 models. Powering off and waiting a couple of minutes usually works
Based on post codes I think it hangs in edk2. Need to build coreboot with EC logging enabled, EDK2 in debug mode with serial redirection enabled, and EC with parallel debugger enabled. Then just check in logs where it hangs.
We have found that the V560TNE bricks when suspending with the default kernel of Ubuntu 24.04 LTS, fully updated.
The same happens after trying to integrate a custom boot logo with DTS (RC) on the V560TND.
This issue should be top-priority.
I can confirm that the laptop that didn't turn on after suspending could be turned on again once it was cooled down.
The issues with suspension are fixed by upgrading the kernel to 6.9. Changing the boot logo was working fine today on our V560TNE with v0.9.1-rc4 and kernel 6.9.
https://docs.dasharo.com/unified/clevo/post-install/#linux
On Gen 14 (Meteor Lake), it's recommended to install the Ubuntu mainline kernel, which is a newer version than the default Ubuntu kernel. This version contains additional fixes for newer hardware which helps with power management and suspend on Gen 14 laptops.
@filipleple @philipandag But you still do face bricks when using linux 6.9, just not after suspend?
Yes, although they are much rarer.
Very likely to be fixed by https://github.com/Dasharo/ec/commit/3786c8ce8a2562e651c180f094939fae861b20d1 .
The ME_WE pin was floating, and in some conditions (depending on temperature, but also possibly other factors) it would be sampled high instead of low, which in turn caused ME to enter FDOPSS state. When in FDOPSS, sending the End-of-post HECI command would fail, and coreboot would refuse to boot, because booting without sending EOP is considered insecure.
The pin was configured as input, because we got the GPIO config from previous firmware, and missed this error during review.
I am testing the v0.9.1-rc5 on V540TND since yesterday and no bricks happened. Tried quick reboots, consecutive reflashes and stressing the hardware to make it hot. It seems that, at least the V540TND, doesn't have this issue with the newest rc5.
I am testing the v0.9.1-rc5 on V540TND since yesterday and no bricks happened. Tried quick reboots, consecutive reflashes and stressing the hardware to make it hot. It seems that, at least the V540TND, doesn't have this issue with the newest rc5.
Same here for rc5 on the V560TND and V560TNE.
In that case I believe we can close this issue
Component
Dasharo firmware
Device
NovaCustom V56 14th Gen
Dasharo version
v0.9.1-rc2
Dasharo Tools Suite version
--
Test case ID
--
Brief summary
NovaCustom V560TNE bricks most likely due to overheating, usually when rebooting after flashing
How reproducible
90%
How to reproduce
Turn on any regression suite.
Expected behavior
The suite should pass
Actual behavior
After a few tests and flashes, the laptop will refuse to boot up and will become very hot. It will boot up again after a ~5-10 minute cooldown period
Screenshots
--
Additional context
So far we can reliably reproduce this on one unit only.
Solutions you've tried
Re-applying thermal paste