electrified / asus-wmi-sensors

Linux HWMON (lmsensors) sensors driver for various ASUS Ryzen and Threadripper motherboards
GNU General Public License v2.0
251 stars 30 forks source link

module causing false overtemp issues on Ryzen 3900X #33

Closed KeithMyers closed 4 years ago

KeithMyers commented 5 years ago

I am having issues now with the module. It had worked beautifully and with no issue on my ASUS Crosshair VII Hero motherboard with the 2700X cpu.

I have upgraded to the 3900X cpu. I am now getting regularly shut down with cpu overtemp error. The computer reboots to the BIOS splash screen and sits there until the error is acknowledged with the F1 key. If you then page to the Monitor screen, you will see the cpu temp highlighted in red. I have mostly seen the value of 78°C but this morning just now I saw the value of 85° C. I was not able to to capture the value because I did not correctly turn on my phones movie mode. The error clear within a second after landing on the Monitor page so it is hard to capture it with a snapshot.

I am having a discussion on the C7H OCN thread about the problem and not getting anywhere since all the thread participants are Windows users. One person has been helpful in suggesting that the WMI interface in the BIOS was still buggy when Elmor left. He also suggests that ASUS could have introduced a new bug in the WMI interface for the new BIOS's to use Ryzen 3000 cpus. I should not be getting any overtemp errors at these reported temps, even if they were real. The forum participants have reported the cpu does not protect itself until it hits between 105-110°C when they have had the fans or pump stop running.

I have had neither fans nor pump stop running as they are powered directly from the power supply. My running load temps are normally in the high 60's to low 70's.

Is it possible the module needs updating to handle Ryzen 3000 cpus? Is anybody here running the module with a Ryzen 3900X besides me?

It has been suggested that the module is causing the issue and to remove it. I can do that but would then have no sensors at all since k10temp is now non-operational on Ryzen 3000.

This is the overtemp error screen on the BIOS splash screen.

https://live.staticflickr.com/65535/48447139302_c86a53978b_c.jpg

vincentkfu commented 5 years ago

I was able to get k10temp working for Ryzen 3000 by modifying a patch submitted for the linux kernel: https://github.com/vincentkfu/k10temp

KeithMyers commented 5 years ago

How did you get this to work? When I attempted to download and compile the k10temp.c module from the upstream kernel repository, it puked because I did not have the amd_mb.c and pci_ids.h files.

vincentkfu commented 5 years ago

I just applied a modified version of the patch from https://patchwork.kernel.org/patch/11043271/ to the k10temp from https://github.com/groeck/k10temp Actually groeck has updated his repository to support Ryzen 3000. So you should probably use his. Edit: Actually I get 206.9C readings with his. Not sure what's going on.

electrified commented 5 years ago

Zenpower/Zenmonitor may be another option too

https://github.com/ocerman/zenpower

Regarding your issue @KeithMyers, this may be the issue with WMI and Matisse that the other hardware monitoring author I contacted was talking about.

The WMI interface is really high level - methods to get count of sensors, get sensor names, get sensor value - there is nothing to adjust for Ryzen 3000, those adjustments would be within the WMI interface BIOS code.

Can you see what happens under Windows with HWiNFO? It would be interesting to see if a) it still uses WMI with Zen 2 b) if it causes the behaviour you are seeing.

I've seen people seeing 1.5v and boost clocks when having HWiNFO open, so wondering if this is the same thing.

KeithMyers commented 5 years ago

I actually have finally figured out what is happening. The BIOS is turning off all the motherboard fan headers. This was mentioned in the forums but I never found the fans other than running before. This time I was once again using the computer and looked over at Gkrellm monitor and saw the cpu at 92°C and 0 fan rpms for the five fans I have connected to the motherboard headers. Grabbed a flashlight quickly and saw that in fact no fans were spinning. So I think the mystery is finally solved. I have now attached the 3 radiator fans to a molex connector from the power supply to take the motherboard out of the picture. I will order up a fan controller for now to plug the case fans into.

I will download and reinstall the k10temp driver from Guenter.

KeithMyers commented 5 years ago

k10temp driver from Guenter Roeck's repository does not work. keith@Serenity:~$ sensors k10temp-pci-00c3 Adapter: PCI adapter Tdie: +206.9°C (high = +70.0°C) Tctl: +206.9°C

Looks like it is trying to poll from an unterminated header like what we see with asus-wmi-sensor for unterminated headers. I think you have to have the other piece of the puzzle. You have to have those updated PCI id's in those amd_mb.c and pci_ids.h files

electrified commented 5 years ago

Best to keep this issue open as a warning for others with Ryzen 3000 CPUs not to use the driver.

KeithMyers commented 5 years ago

I'm posting over at Guenter's repository. I asked the question and need to get it confirmed. The k10temp driver won't work without the latest 5.4 kernel I believe to have all the necessary parts for it to work.

laichiaheng commented 5 years ago

I actually have finally figured out what is happening. The BIOS is turning off all the motherboard fan headers. This was mentioned in the forums but I never found the fans other than running before. This time I was once again using the computer and looked over at Gkrellm monitor and saw the cpu at 92°C and 0 fan rpms for the five fans I have connected to the motherboard headers. Grabbed a flashlight quickly and saw that in fact no fans were spinning. So I think the mystery is finally solved. I have now attached the 3 radiator fans to a molex connector from the power supply to take the motherboard out of the picture. I will order up a fan controller for now to plug the case fans into.

I will download and reinstall the k10temp driver from Guenter.

Is it a BIOS bug? My case fans and CPU fan have stopped for 2 times in 2 months. Crosshair VII Hero.

KeithMyers commented 5 years ago

I never has one issue with any fan on any of my Crosshair VII Hero hosts when they were on BIOS 1002. I have updated only one host to BIOS 2501 for the 3900X cpu and now have fan issues where the BIOS turns off all fans twice a day on average and the host cpu overtemps and reboots. I solved that issue by moving all fans off the motherboard headers to a fan hub controller powered by a SATA connection.

KeithMyers commented 5 years ago

FYI, Guenter Roeck has updated his k10temp repo at github to a new version driver that works with Ryzen 3000. Running that along with the asus-wmi-sensor driver with no issues.

laichiaheng commented 5 years ago

@KeithMyers Is there any temporary workaround for it? It just stopped last night, there must be something to trigger it.

KeithMyers commented 5 years ago

From what I and others in the Crosshair VII Hero OCN threads have determined, it is a flaw in the Ryzen 3000 BIOS'. So every BIOS from 2304 to 2501 has the flaw. Nothing can be done about it till they fix it. I removed all fans from the motherboard headers and power them from a fan hub controller now until they fix the bug. At least I don't get cpu overtemp shutdowns anymore. Hope that a more mature BIOS returns the original functionality to the motherboard eventually.

laichiaheng commented 5 years ago

@KeithMyers My BIOS version is 2602 (AGESA 1.0.0.3AB), it still has the problem. ASUS releases their new BIOS so slowly, the version 2602 is even only released in their ROG forum. You also removed the CPU fan from motherboard?

KeithMyers commented 5 years ago

Yes. I removed all the fans from the motherboard and plugged them into the fan controller. You can always set the cpu fan monitor to IGNORE in the BIOS so it doesn't trigger the no cpu fan detected error.

I have the fan hub controller outputting the radiator fans to the cpu fan header for my benefit. Not that it was required or anything.

electrified commented 5 years ago

The issue seems to be that when using Ryzen 3000 CPUs, polling the WMI interface can cause the CPU fans to stop. There is some people experiencing it in this thread on the CH6 with HWiNFO and HwMonitor: https://rog.asus.com/forum/showthread.php?112159-2-Bios-Bugs-with-CROSSHAIR-VI-HERO

There are also some posts on Reddit with people having the same issue.

I've not got much hope Asus will fix the issue... :(

laichiaheng commented 5 years ago

@electrified Do you mean if we don't watch the temperature or fan speed from the motherboard, this issue will not be triggered?

electrified commented 5 years ago

I don't have a Ryzen 3000 CPU, so I'm not experiencing the issue personally, but I believe that might be the case.

I've seen a few people say they think the monitoring is the cause of the issue, e.g. from that thread

On my last reboot, I abstained from running hwmonitor or hwinfo anything longer than 5~10 minutes, only to check status then exit. No problems since. Leads me to be pretty convinced hwmonitor/hwinfo/etc utilities calling/polling the motherboard's sensors/controller for these rpm/temperature values are causing the problems with fan stopping/ramping down to terribly low level. I don't think this ever happened once on my R7 1700 on the C6H Bios 6301 so it may be the BIOS itself (or AGESA? if it's possible even).

KeithMyers commented 5 years ago

Don't think that is the whole story. Just polling the WMI interface for temps and voltages has not caused any issues on my 3900X once moving the fans off the motherboard. I have not seen any of the polled values stop or go to zero or display nonsensical values once I took the fans off the headers. I still am reporting the one radiator fan rpm value to cpu fan header that comes from the controller. I have been using the fan hub controller now for over a week and not a single cpu overtemp issue or any other sensor issue so far. I would have had a dozen events by now if I had kept things the way they were originally.

I have not seen any posts on the C7H OCN threads correlating fan stoppage to WMI monitoring programs so far.

KeithMyers commented 5 years ago

Just saw this post my Mumak (developer of HwInfo) that confirms what we suspected about the Ryzen 3000 BIOS. ASUS forgot what they learned on the earlier WMI BIOS' and re-introduced the bug that causes fans to stop if multiple programs access the WMI interface. They dropped the mutex lock on the interface in the recent BIOS.

Yup, it looks like ASUS re-introduced this issue in latest BIOSes. WMI interface was originally developed to solve problems with concurrent access to the buggy SIO chip (IT8665) and that worked well for some time. But after switching the AGESA base from PinnaclePI to ComboPI it looks like the WMI implementation on BIOS side isn't working well.

laichiaheng commented 5 years ago

Have they fixed it in the ABB BIOS?

KeithMyers commented 5 years ago

Supposedly the beta 002 BIOS has fixed the fans problem. Derivative of the official 2703 BIOS which is AGESA 1.0.0.3ABB. Speaking of the Crosshair VII Hero here.

laichiaheng commented 5 years ago

@KeithMyers Bad news, it happens again with 1.0.0.3ABB

KeithMyers commented 5 years ago

Is that the beta 002 BIOS or just the regular 2703 BIOS? The 2703 BIOS does not have the fan fix yet. Only Shamino's private beta 002 BIOS supposedly has it fixed.

laichiaheng commented 5 years ago

@KeithMyers It's the 2703 version, when will they update it.

KeithMyers commented 5 years ago

Who knows with ASUS. I guess once they approve the beta BIOS as good enough for general release.

electrified commented 5 years ago

@laichiaheng The beta BIOSes are here: https://rog.asus.com/forum/showthread.php?112744-Crosshair-VI-Wifi-fan-control-sensors-are-broken#post781768

I have not tried them.

KeithMyers commented 5 years ago

I am just going to wait for an official BIOS that fixes the issue. I still might not even bother testing it and just let others report. I solved the issue on my host by using a separate fan controller. It works. I would have to rip it out and plug all the fans back into the motherboard headers again to test. Need to remove one video card to get access to one of the headers I am using for a rear fan.

With ASUS track record with the WMI interface, there is no guarantee that the issue won't pop up again at a later time with another BIOS. The fan controller removes the problem.

mwweissmann commented 5 years ago

@KeithMyers Can you recommend your external fan controller? Does it play well with Linux? (sorry for getting a little off-topic)

KeithMyers commented 5 years ago

It's just hardware. No software involved. So plays nicely with Linux. It is a 10 port controller by Thermaltake. https://www.amazon.com/gp/product/B01G9BEC5W/ref=ppx_yo_dt_b_asin_title_o02_s00?ie=UTF8&psc=1

Many similar devices are available. Basically runs all the connected fans at 100% but that is not an issue for me since I was already running all fans at 100% via the motherboard headers. If you want direct fan speed control then you have to look for other solutions like the Noctua fan controller. https://www.amazon.com/Noctua-NA-FC1-4-pin-PWM-Controller/dp/B072M2HKSN

laichiaheng commented 5 years ago

I haven't had the fan issue since the 2801 BIOS.

KeithMyers commented 4 years ago

I see the issue is still open and should have been closed after the ASUS 2801 BIOS was released fixing the fan stop issue. Correcting.