BeardOverflow / msi-ec

GNU General Public License v2.0
147 stars 45 forks source link

Alpha 17 B5eek: Something weird about fan-speeds #164

Open Freihut opened 1 month ago

Freihut commented 1 month ago

Laptop model

Alpha 17 B5eek

EC firmware version

17LLEMS1.106

Description

Tl;dr cpu-fan-speed: seems incorrect gpu-fan-speed: plausible, but somehow not in "turbine mode"


I've got some weird readings here:

Situation 1: Created some cpu-load while running: watch --interval 1 cat /sys/devices/platform/msi-ec/cpu/realtime_fan_speed

43 76 96 cat: /sys/devices/platform/msi-ec/cpu/realtime_fan_speed: Invalid argument

(combined output of several seconds)

Pluma (a text editor) also throws the "Invalid argument" at the same time, so likely not a cat issue.

Situation 2: Idle + FN + Arrow up (which makes the fans go into "turbine mode") but msi-ec/cpu/realtime_fan_speed reports "43", while msi-ec/gpu/realtime_fan_speed reports "0".


Meanwhile I get the attached output while reading the ec (/sys/kernel/debug/ec/ec0/io) by a small pascal prog I used before.

Line 1 = the dump of the whole ec-line Line 2 = the gpu-rpm-speed Line 3 = the cpu-rpm-speed Interval is 1000ms.

output1.txt idling laptop, just going into "turbine mode" and went back to normal after some seconds. Msi-ec reports "43" for cpu and "0" for gpu all along.

output2.txt laptop has full cpu load. Cpu-fan is around 3900rpm, while gpu-fan is at 0 and gets turned on, when the gpu reached 55°C (as the case gets warmed up I guess). Msi-ec reports "invalid" for cpu all the time and "0" for gpu in the beginning, later it went up to 43, which is kind of plausible.

The pascal prog I was using for around 1 year all the time, so I'm fairly sure the readings are correct, at least they're plausible.

I'm using the latest BIOS E17LLAMS.10B from 2023-06-15 with Arch Linux on Kernel 6.11.0

(the pascal prog src can be compiled with Lazarus; needs to be run as root (to read /sys/kernel/debug/ec/ec0/io) while ec_sys module is running)

output1.txt output2.txt read_ec.tar.gz

glpnk commented 1 month ago

For many devices, RPM might be set to the wrong address and scaled incorrectly. Actually, EC show not RPM, but % of RPM in range 0-150. Someone in the past tried to "normalize" CPU % RPM to 0-100% range and now it returns some wrong values. Fans turned on-off accordingly to curve, with some hysteresis. Fan mode like silent/auto/advanced just limits max available %RPM to some value without any scaling

IDK what is turbine mode

Freihut commented 1 month ago

rpm-readings vs percent-readings, isn't my point (I'm aware of that).

(1881/5558)×100=33%, msi_ec shows 43(%). (3900/5558)×100=70%, msi_ec shows broken stuff ('invalid argument')

"turbine mode"=fans@maximum, done by FN+Arrow up

glpnk commented 1 month ago

Not all devices have turbine mode, but many have cooler boost which might be same thing

Where you got 5558 number from?

MSI EC don't calculate percents (except broken CPU %RPM meter, which need to be removed)

Don't look onto CPU RPM reported by driver

Freihut commented 1 month ago

Yes, it is the same thing.

5558 = the maximum cpu-fan-rpm (on "turbine mode") for my device, so 100% 1881 = idle, 3900 = cpu-rpm at maximum cpu-load.

The rpm values I've got from my own prog's readings as described in the initial post.

glpnk commented 1 month ago

You can assume that boost speed isn't 100% but 150 or 200, plus correlation may be non-linear

Freihut commented 1 month ago

No, I won't. msi_ec is reading wrong values for this device (I guess they're target-fan-speeds) and doing wrong math with the wrongly taken cpu-fan-speed (by subtracting and dividing addresses) which can result in undefined behavior (for all devices).

glpnk commented 1 month ago

MSI ec did not control fan curve, but I want to fix realtime %rpm readings soon

Freihut commented 1 month ago

In the meantime affected people can use my forked repo for this device. Reads rpm values from the correct addresses.

mutchiko commented 1 month ago

yeah well it's only logical that these addresses are messed up, i was so concentrated on getting shift_mode to work that i completely forgot about testing cpu/gpu fans speed addresses.

now that i remember correctly, i used ec_sys module readings for fans speeds, and not the actual driver itself.

by the way @Freihut your repo works kinda well, the realtime_fan_speed file in /sys/devices/platform/msi-ec/cpu/ is broken (impossible to open); same with the gpu file except it shows 0 all the time so you might want to check with that too.

Freihut commented 1 month ago

now that i remember correctly, i used ec_sys module readings for fans speeds, and not the actual driver itself.

That's fine, as they both should read from the same source. Letting the device idle and using watch --interval 1 sudo xxd -g 1 /sys/kernel/debug/ec/ec0/io (or maybe a smaller interval) while playing around with the turbines "cooler boost" is IMO the best way to find the fan-adresses.

repo works kinda well, the realtime_fan_speed file in /sys/devices/platform/msi-ec/cpu/ is broken (impossible to open);

That's were my changes are, so it's not "well" at all. :c The code in my fork only works for the Alpha 17 b5eek (CONF22), because it needs .rt_fan_speed_fallback in .cpu = {} and .gpu = {} to be set. Haven't done this for the other devices, because I can't test this and meight be a device-specific-workaround. If you're using the same hardware as me, 0xcd and 0xcb in your ec are not matching the fan-speeds.

same with the gpu file except it shows 0 all the time so you might want to check with that too

If /sys/devices/platform/msi-ec/gpu/realtime_fan_speed reports 0 and you're 100 % sure the GPU-Fan is running (GPU-Temp > 55°C or the coolerboost is on) then it also reads on a wrong address (0xcb) and therefore displays the fallback.

mutchiko commented 1 month ago

@Freihut before i continue testing the fans speed readings with you, i'd like to confirm a few things in advance:

  1. output of sudo dmesg | grep error
  2. both iGPU and dGPU usage underload (notice anything wrong?)
  3. idle cpu temperature (after booting and logging in from a cold start)
  4. max power limit reported by nvtop or amdgpu top for the rx6600m
  5. any bios settings that you changed

please do all of these under linux, thanks.

P.S: what you call turbine mode is actually turbo boost.

Freihut commented 1 month ago
1. output of `sudo dmesg | grep error`

just a bunch (less than 10) of ACPI Error: Aborting method \_SB.PCI0.SBRG.EC._Q9A due to previous error (AE_NOT_EXIST) (20240322/psparse-529).

2. both iGPU and dGPU usage underload (notice anything wrong?)

What is that question for? That's reported by amdgpu (which's just passing firmware readouts) and more or less reasonable. ("More or less" because values reported by the firmware are "meh").

3. idle cpu temperature (after booting and logging in from a cold start)

Around 50°C, depending on room-temp.

4. max power limit reported by nvtop or amdgpu top for the rx6600m

According to amdgpu it is 65w. With Furmark and smartshift enabled I can push the dGPU to around 68w, but /sys/class/drm/card[X]/device/hwmon/hwmon7/power1_cap_max still reports 65w.

5. any bios settings that you changed

My device reports fan-rpm-speeds on 0xcb and 0xcd even for BIOS defaults.

Settings I've changed and can remember: Smartshift, secure boot and modern standby off, UMA for iGPU to 512Mib. But like I wrote: I used these addresses for about 1~2 years now and they never changed and always report plausible speeds. At least for my device.

P.S: what you call turbine mode is actually turbo boost.

Ya, I know, but turbine mode sounds better. :)

BTW, I just made a gui-tool to live view the ec. It highlights changes and does some math to help find fan-speed-addresses. But its pretty alpha right now.

mutchiko commented 1 month ago

the reason i asked you these questions is that i'm trying to see if the driver is functioning properly before re-checking other addresses, for example: disabling smartshift from bios will prevent the ec from doing any actual performance changes when you change shift mode in the driver or in the msi dragon center, but will change the fans curves.

disabling modern standby will reset all the power/performance changes after waking up from sleep, you'll have to re apply them by re selecting the performance mode (shift mode) that you want; if its enabled, you should see an mp2 acpi error that is related to modern standby. thats why i asked you for acpi errors.

i asked you for gpu usage because the vbios has an issue that makes it report 99% on almost any load.

According to amdgpu it is 65w

seems like smartshift doesn't work on linux for some reason.

users of the alpha 15 reported that it works fine, after further searching i found out that the RX6600M vbios is different from the one found on the alpha 17 ; i assume that flashing alpha 15 vbios might fix the issue, but it might brick your laptop.

I just made a gui-tool to live view the ec

just tried it out and its really cool, hopefully it will make it easier for people to test if the driver is working correctly on their laptops or not, thanks for your work.

Freihut commented 1 month ago

the reason i asked you these questions [...]

Thanks for explaining.

i asked you for gpu usage because the vbios has an issue that makes it report 99% on almost any load.

I can remember that this occured to me some days ago after standby. But I just tried to reproduce that and both gpus keep reporting sane utilization values. Weird. (No updates happened between these situations).

seems like smartshift doesn't work on linux for some reason.

It kinda does, but in a weird way and it keeps changing as the kernel progresses. 2 years ago smartshift shifted alot to the gpu (if I remember correctly it ran at about ~85w and the cpu dropped to 2,5 Ghz). With the current kernel it shifts about 3w, but very slowly (you can see that the gpus power draw increase over several minutes of load). Any value to the somethingbiassomething-file had no effect.

Smartshift also has some side effects on ryzenadj, but I couldn't figure out what exactly happens there.

just tried it out and its really cool, hopefully it will make it easier for people to test if the driver is working correctly on their laptops or not, thanks for your work.

Thanks for the feedback, I'm glad to help.

mutchiko commented 2 weeks ago

I did my testing and @Freihut is right:

Values contained in these 2 addresses are percentages for the target speed, not actual speed in rpms; the file /sys/devices/platform/msi-ec/cpu/realtime_fan_speed is unreadable if the target percentage is below 25% or above 55%.

There seems to be a mismatch between the values reported by ec_sys and msi-ec: when target percentage is 25%, msi-ec reports 0%, and when target is 55%, msi-ec reports 100.

so its only possible to load the file if the target is between 25% to 55%.

mutchiko commented 2 weeks ago

lets fix things one at a time, correct addresses take priority, @Freihut do you want me to fix it or do you want to make a merge request yourself?

Freihut commented 2 weeks ago

Wait a minute, you can't just fix the addresses, because this needs a rather big overhaul in calculating the fan speeds.

Look at the way I calculate the rpm in my forked code.

But this works only for the Alpha 17 b5eek (and of course devices using the same fans). To fix this for all users you'll need to add the Fallback-rpm for each device currently supported or find the addresses to make msi-ec read that out by itself.