erpalma / throttled

Workaround for Intel throttling issues in Linux.
MIT License
2.66k stars 164 forks source link

T580 with dGPU (too hot) #49

Open Uatschitchun opened 6 years ago

Uatschitchun commented 6 years ago

Hi there,

firstly let me thank you for your work on this regard!

I've got multiple questions and some problems, you could help me with:

1.) If I disable systemd service, it seems the PL1/2 settings aren't reverted to the original values, correct? Even after stopping the service and rebooting, turbostat reports them still as what I last have set them with lenovo-fix?

Ok, tested again. I explicitly set PL1_Tdp=17 and after disable systemd service and rebooting, this is what turbostat reports:

cpu0: PKG Limit #1: ENabled (29.000000 Watts, 28.000000 sec, clamp ENabled)                                    
cpu0: PKG Limit #2: ENabled (44.000000 Watts, 0.002441* sec, clamp DISabled)  

Setting anything with lenovo-fix then, shows up with turbostat then. Stopping lenovo-fix again, does not reset the values!

Btw, from the above: So 29 & 44 are system's defaults and safe to be set in conf as these are system's defaults?

2.) As my Laptop has a dedicated Nvidia GPU, too, there's a heat problem, when using your script with installation standards. The system is running stable and fast, but the dGPU has a fall-off temperature of 76°C. As soon as the GPU reaches this value, it gets throttled to a 3rd of it's frequency (around 400MHz, instead of 1600). Strange thing here is, the GPU only levels its frequency within a range of 100MHz (1600-1700) when running glxspheres64 for instance. So, what happens is, the GPU (as not able to throttle itself down more in multiple steps) reaches the 76° quite fast, as the machine's cooling system isn't able to vent off the heat resulting from higher performance with lenovo-fix.

I stumble upon this, as phoronix/openarena gets from app. 120FPS to app. 35FPS.

The GPU is clocked with around 400MHz until it gets down below 60° again, which doesn't happenm when CPU is producing heat.

So, the installation standards do not fit this machine when used with dGPU.
I tried undervolting, which makes the system last longer until the GPU's fall-off is reached, but system still gets too hot!

Using this config:

[GENERAL]
Enabled: True

## Settings to apply while connected to AC power
[AC]
# Update the registers every this many seconds
Update_Rate_s: 5
# Max package power for time window #1
PL1_Tdp_W: 15
# Time window #1 duration
PL1_Duration_s: 28
# Max package power for time window #2
PL2_Tdp_W: 44
# Time window #2 duration
PL2_Duration_S: 0.002
# Max allowed temperature before throttling
Trip_Temp_C: 93
# Set HWP energy performance hints to 'performance' on high load (EXPERIMENTAL)
HWP_Mode: False
# Set cTDP to normal=0, down=1 or up=2 (EXPERIMENTAL)
cTDP: 0

[UNDERVOLT]
# CPU core voltage offset (mV)
CORE: -120
# Integrated GPU voltage offset (mV)
GPU: -100
# CPU cache voltage offset (mV)
CACHE: -120
# System Agent voltage offset (mV)
UNCORE: -120
# Analog I/O voltage offset (mV)
ANALOGIO: 0

With glxspehere64 CPU is running around 3800MHz, 15W & 80°, which is nice. But dGPU is constantly heating up, as not clocked further down than 1657MHz and reached it's thermal limit of 76° with hard fall-off.

So, I somehow need a way, to solve this, as once I bring in the dGPU into the mix, I need less power, for to to overpower the cooling system, whereas performance withou dGPU heavy working, is nice'n stable...

2a) When heating up, the touchpad gets irresponsive until I end glxsphere!? No entries in journal about that!?

3.) Is my following assumption correct?
The I7-8550U is declared with 15W TDP. cTDP_up would be 25W. So if setting PL1_Tdp=25, I need to set cTDP=2?

And, with having a TDP of 15W, setting 44W in conf is quite high above the TDP? Or am I missing something?

Tests with s-tui & mprime -t showed, that even when setting cTDP=1, system is using app. 30W while around 3.300MHz and around 88° with s-tui stress test. So that's double the TDP and 3 times cTDP (10W when down).

4.) I need a little help regarding PL1/2. If I set 29/44W PL1/2, using stress from s-tui, I get 3700MHz for about haf a minute, then Power drops to around 3400MHz and 28W Power. Expected, as PL1 is 29W. The stress is running continously with this power and frequency then. No further throttling to 15W is experienced?! System is undervolted:

[UNDERVOLT]
# CPU core voltage offset (mV)
CORE: -120
# Integrated GPU voltage offset (mV)
GPU: -100
# CPU cache voltage offset (mV)
CACHE: -120
# System Agent voltage offset (mV)
UNCORE: -120
# Analog I/O voltage offset (mV)
ANALOGIO: 0

My system is: T580, I7-8550U, dGPU, 16GB Ram, Bionic 18.04, nvidia-396

Uatschitchun commented 6 years ago

Regarding 2a): journalctl:

Aug 31 17:09:34 T580-Test kernel: thinkpad_acpi: unknown possible thermal alarm or keyboard event received
Aug 31 17:09:34 T580-Test kernel: thinkpad_acpi: unhandled HKEY event 0x6032
Aug 31 17:09:34 T580-Test kernel: thinkpad_acpi: please report the conditions when this event happened to ibm-acpi-devel@lists.sourceforge.net

I'll report that... Right after stopping glxspehere, the touchpad is responsive again!

Uatschitchun commented 6 years ago

Regarding 2.) Without using lenovo-fix, the power gets dynamically adjusted, down to 7W & 2400MHz, keeping the dGPU from hitting 76! fall-off

Someway to mimic this within this fix?

DEvil0000 commented 6 years ago

1) Yes, it will not get reset. You need to power off your machine to reset the values - a reboot may not be enough. You may want to disable the service so they do not get applied next time. However there may also be some other software changing values (e.g. thermald). 2) I don't know what could get configured for the GPU or how much heat the cooling of your model can handle. Also some other software may throttle it like thermald. You can try setting Trip_Temp_C to a lower temperature. 2a) sounds like too high temperature on some peripheral IC or possibly a driver issue. Or again other software interfering 3) correct, as stated in the comment in the config file 2 corresponds to cTDP up (25W). However the turbo is allowed to go beyond that for a short period of time and thats why you may set PL to 44W. 4) First of all your undervolting might be too much. Most Lenovo laptops seem to get unstable at about -90, -70, -90, -70, -50 (or around that). Beside that this sounds good.

Hope that helps you investigating it more. Your issue is quite interesting. edit: Also try --debug

erpalma commented 6 years ago
  1. That thermal limit is ridiculous for a laptop. Can you check with GPU-Z (on Windows) which is the maximum temperature for the GPU? It should be 94/97 'C. I guess that like the CPU also the GPU is throttled too early but I don't know if we can force that limit to be higher somehow. Also sharing the same heat-pipe and fan for both the CPU and the GPU is a major limiting factor on performance. If I understand correctly you are asking for a way to monitor the GPU usage in order to limit the CPU temperature when both are used, right?

  2. On the 8550u the standard 15W cTDP allows you to reach 1.8 GHz of base frequency, while setting cTDP up to 25W should raise it to 2.0 GHz. On the other end PL1/2 are used to set the upper power usage limit when turbo frequencies are in use. There is no real limit (kind of) to how high you can push these values if your cooling system can handle the heat and your power circuitry the requested current.

Uatschitchun commented 6 years ago

Thx for the prompt answer ;)

No thermald, no tlp (atm) Thing is, from my investigations, that if I'm using the fix, power gets "fixed" on that value, whereas without the fix, system adapts power slightly up and down, according to the dGPU thermal, so it won't reach its 76° fall-off temp. This makes sense, as there are 2 but 1 devices heating up, when putting load on dGPU.

What I don't get is why the dGPU doesn't get throttled more, prior to falling off?

I'm monitoring the dGPU with nvidia-smi. It states:

    Temperature
        GPU Current Temp            : 57 C
        GPU Shutdown Temp           : 102 C
        GPU Slowdown Temp           : 97 C
        GPU Max Operating Temp      : 94 C
        Memory Current Temp         : N/A
        Memory Max Operating Temp   : N/A
    Clocks
        Graphics                    : 1683 MHz
        SM                          : 1683 MHz
        Memory                      : 3003 MHz
        Video                       : 1506 MHz
    Applications Clocks
        Graphics                    : N/A
        Memory                      : N/A
    Default Applications Clocks
        Graphics                    : N/A
        Memory                      : N/A
    Max Clocks
        Graphics                    : 1911 MHz
        SM                          : 1911 MHz
        Memory                      : 3004 MHz
        Video                       : 1708 MHz

So yes, the temps are shown, as you state them. The max graphics clock (1911MHz) isn't reached ever? Max clock I get is around 1700 and it throttles no more than 100MHz. I've found some documentation regarding this 76° fall-off. (lemme see...) It seems some kind of EC throttling set by lenovo.

So with using the fix, I'm loosing frequency/power dynamic, which is needed to keep the dGPU below 76°, as if some kind of automatic gets disabled.

So, I like to use the fix, as it gives a good amount of power to the laptop, but still be able to use the dGPU without hitting the thermal limit.

Uatschitchun commented 6 years ago

www.reddit.com/r/thinkpad/comments/8flj0i/t480_power_limit_throttles_down_to_5_watts_on/

Uatschitchun commented 6 years ago

Could this be helpful, to reset to defaults, when service is stopped?

Default power limits can be found in the PKG_PWR_SKU MSR (614h)

www.technodocbox.com/PC_Support/74817174-8th-generation-intel-processor-family-for-s-processor-platforms.html See below the PDF No. 91

DEvil0000 commented 6 years ago

throttle at 76°C is way to early. 90 i would understand.since it states 94 for max operating temp i would cap the cpu to max 90 or 92. since they share the cooling. maybe even 85 or such.changing the fans config with thinkfan may also help you.

-------- Ursprüngliche Nachricht -------- Von: Uatschitchun notifications@github.com Datum: 31.08.2018 19:36 (GMT+01:00) An: erpalma/lenovo-throttling-fix lenovo-throttling-fix@noreply.github.com Cc: "A. Binzxxxxxx" alexander@binzberger.de, Comment comment@noreply.github.com Betreff: Re: [erpalma/lenovo-throttling-fix] T580 with dGPU (too hot) (#49)

Thx for the prompt answer ;) No thermald, no tlp (atm)

Thing is, from my investigations, that if I'm using the fix, power gets "fixed" on that value, whereas without the fix, system adapts power slightly up and down, according to the dGPU thermal, so it won't reach its 76° fall-off temp. This makes sense, as there are 2 but 1 devices heating up, when putting load on dGPU. What I don't get is why the dGPU doesn't get throttled more, prior to falling off? I'm monitoring the dGPU with nvidia-smi. It states: Temperature GPU Current Temp : 57 C GPU Shutdown Temp : 102 C GPU Slowdown Temp : 97 C GPU Max Operating Temp : 94 C Memory Current Temp : N/A Memory Max Operating Temp : N/A Clocks Graphics : 1683 MHz SM : 1683 MHz Memory : 3003 MHz Video : 1506 MHz Applications Clocks Graphics : N/A Memory : N/A Default Applications Clocks Graphics : N/A Memory : N/A Max Clocks Graphics : 1911 MHz SM : 1911 MHz Memory : 3004 MHz Video : 1708 MHz

So yes, the temps are shown, as you state them. The max graphics clock (1911MHz) isn't reached ever? Max clock I get is around 1700 and it throttles no more than 100MHz.

I've found some documentation regarding this 76° fall-off. (lemme see...) It seems some kind of EC throttling set by lenovo. So with using the fix, I'm loosing frequency/power dynamic, which is needed to keep the dGPU below 76°, as if some kind of automatic gets disabled. So, I like to use the fix, as it gives a good amount of power to the laptop, but still be able to use the dGPU without hitting the thermal limit.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread. {"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/erpalma/lenovo-throttling-fix","title":"erpalma/lenovo-throttling-fix","subtitle":"GitHub repository","main_image_url":"https://assets-cdn.github.com/images/email/message_cards/header.png","avatar_image_url":"https://assets-cdn.github.com/images/email/message_cards/avatar.png","action":{"name":"Open in GitHub","url":"https://github.com/erpalma/lenovo-throttling-fix"}},"updates":{"snippets":[{"icon":"PERSON","message":"@Uatschitchun in #49: Thx for the prompt answer ;)\r\n\r\nNo thermald, no tlp (atm)\r\nThing is, from my investigations, that if I'm using the fix, power gets \"fixed\" on that value, whereas without the fix, system adapts power slightly up and down, according to the dGPU thermal, so it won't reach its 76° fall-off temp. This makes sense, as there are 2 but 1 devices heating up, when putting load on dGPU.\r\n\r\nWhat I don't get is why the dGPU doesn't get throttled more, prior to falling off?\r\n\r\nI'm monitoring the dGPU with nvidia-smi. It states:\r\n\r\n Temperature\r\n GPU Current Temp : 57 C\r\n GPU Shutdown Temp : 102 C\r\n GPU Slowdown Temp : 97 C\r\n GPU Max Operating Temp : 94 C\r\n Memory Current Temp : N/A\r\n Memory Max Operating Temp : N/A\r\n Clocks\r\n Graphics : 1683 MHz\r\n SM : 1683 MHz\r\n Memory : 3003 MHz\r\n Video : 1506 MHz\r\n Applications Clocks\r\n Graphics : N/A\r\n Memory : N/A\r\n Default Applications Clocks\r\n Graphics : N/A\r\n Memory : N/A\r\n Max Clocks\r\n Graphics : 1911 MHz\r\n SM : 1911 MHz\r\n Memory : 3004 MHz\r\n Video : 1708 MHz\r\n\r\nSo yes, the temps are shown, as you state them. The max graphics clock (1911MHz) isn't reached ever? Max clock I get is around 1700 and it throttles no more than 100MHz. \r\nI've found some documentation regarding this 76° fall-off. (lemme see...) It seems some kind of EC throttling set by lenovo.\r\n\r\nSo with using the fix, I'm loosing frequency/power dynamic, which is needed to keep the dGPU below 76°, as if some kind of automatic gets disabled.\r\n\r\nSo, I like to use the fix, as it gives a good amount of power to the laptop, but still be able to use the dGPU without hitting the thermal limit. \r\n\r\n"}],"action":{"name":"View Issue","url":"https://github.com/erpalma/lenovo-throttling-fix/issues/49#issuecomment-417738078"}}} [ { "@context": "http://schema.org", "@type": "EmailMessage", "potentialAction": { "@type": "ViewAction", "target": "https://github.com/erpalma/lenovo-throttling-fix/issues/49#issuecomment-417738078", "url": "https://github.com/erpalma/lenovo-throttling-fix/issues/49#issuecomment-417738078", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { "@type": "Organization", "name": "GitHub", "url": "https://github.com" } }, { "@type": "MessageCard", "@context": "http://schema.org/extensions", "hideOriginalBody": "false", "originator": "AF6C5A86-E920-430C-9C59-A73278B5EFEB", "title": "Re: [erpalma/lenovo-throttling-fix] T580 with dGPU (too hot) (#49)", "sections": [ { "text": "", "activityTitle": "Uatschitchun", "activityImage": "https://assets-cdn.github.com/images/email/message_cards/avatar.png", "activitySubtitle": "@Uatschitchun", "facts": [

] } ], "potentialAction": [ { "name": "Add a comment", "@type": "ActionCard", "inputs": [ { "isMultiLine": true, "@type": "TextInput", "id": "IssueComment", "isRequired": false } ], "actions": [ { "name": "Comment", "@type": "HttpPOST", "target": "https://api.github.com", "body": "{\n\"commandName\": \"IssueComment\",\n\"repositoryFullName\": \"erpalma/lenovo-throttling-fix\",\n\"issueId\": 49,\n\"IssueComment\": \"{{IssueComment.value}}\"\n}" } ] }, { "name": "Close issue", "@type": "HttpPOST", "target": "https://api.github.com", "body": "{\n\"commandName\": \"IssueClose\",\n\"repositoryFullName\": \"erpalma/lenovo-throttling-fix\",\n\"issueId\": 49\n}" }, { "targets": [ { "os": "default", "uri": "https://github.com/erpalma/lenovo-throttling-fix/issues/49#issuecomment-417738078" } ], "@type": "OpenUri", "name": "View on GitHub" }, { "name": "Unsubscribe", "@type": "HttpPOST", "target": "https://api.github.com", "body": "{\n\"commandName\": \"MuteNotification\",\n\"threadId\": 374702732\n}" } ], "themeColor": "26292E" } ]​

DEvil0000 commented 6 years ago

0x614 is a good idea to have a look at. but my I7-8550U does not have it. so i guess you do not have it as well. try read/write it with rdmsr wrmsr

Uatschitchun commented 6 years ago
$ sudo rdmsr 0x614
78

Or how is it done?

DEvil0000 commented 6 years ago

can you write the same value back? just as a test?

-------- Ursprüngliche Nachricht -------- Von: Uatschitchun notifications@github.com Datum: 31.08.2018 21:27 (GMT+01:00) An: erpalma/lenovo-throttling-fix lenovo-throttling-fix@noreply.github.com Cc: "A. Binzxxxxxx" alexander@binzberger.de, Comment comment@noreply.github.com Betreff: Re: [erpalma/lenovo-throttling-fix] T580 with dGPU (too hot) (#49)

$ sudo rdmsr 0x614 78

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread. {"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/erpalma/lenovo-throttling-fix","title":"erpalma/lenovo-throttling-fix","subtitle":"GitHub repository","main_image_url":"https://assets-cdn.github.com/images/email/message_cards/header.png","avatar_image_url":"https://assets-cdn.github.com/images/email/message_cards/avatar.png","action":{"name":"Open in GitHub","url":"https://github.com/erpalma/lenovo-throttling-fix"}},"updates":{"snippets":[{"icon":"PERSON","message":"@Uatschitchun in #49: \r\n$ sudo rdmsr 0x614\r\n78\r\n"}],"action":{"name":"View Issue","url":"https://github.com/erpalma/lenovo-throttling-fix/issues/49#issuecomment-417766994"}}} [ { "@context": "http://schema.org", "@type": "EmailMessage", "potentialAction": { "@type": "ViewAction", "target": "https://github.com/erpalma/lenovo-throttling-fix/issues/49#issuecomment-417766994", "url": "https://github.com/erpalma/lenovo-throttling-fix/issues/49#issuecomment-417766994", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { "@type": "Organization", "name": "GitHub", "url": "https://github.com" } }, { "@type": "MessageCard", "@context": "http://schema.org/extensions", "hideOriginalBody": "false", "originator": "AF6C5A86-E920-430C-9C59-A73278B5EFEB", "title": "Re: [erpalma/lenovo-throttling-fix] T580 with dGPU (too hot) (#49)", "sections": [ { "text": "", "activityTitle": "Uatschitchun", "activityImage": "https://assets-cdn.github.com/images/email/message_cards/avatar.png", "activitySubtitle": "@Uatschitchun", "facts": [

] } ], "potentialAction": [ { "name": "Add a comment", "@type": "ActionCard", "inputs": [ { "isMultiLine": true, "@type": "TextInput", "id": "IssueComment", "isRequired": false } ], "actions": [ { "name": "Comment", "@type": "HttpPOST", "target": "https://api.github.com", "body": "{\n\"commandName\": \"IssueComment\",\n\"repositoryFullName\": \"erpalma/lenovo-throttling-fix\",\n\"issueId\": 49,\n\"IssueComment\": \"{{IssueComment.value}}\"\n}" } ] }, { "name": "Close issue", "@type": "HttpPOST", "target": "https://api.github.com", "body": "{\n\"commandName\": \"IssueClose\",\n\"repositoryFullName\": \"erpalma/lenovo-throttling-fix\",\n\"issueId\": 49\n}" }, { "targets": [ { "os": "default", "uri": "https://github.com/erpalma/lenovo-throttling-fix/issues/49#issuecomment-417766994" } ], "@type": "OpenUri", "name": "View on GitHub" }, { "name": "Unsubscribe", "@type": "HttpPOST", "target": "https://api.github.com", "body": "{\n\"commandName\": \"MuteNotification\",\n\"threadId\": 374702732\n}" } ], "themeColor": "26292E" } ]

erpalma commented 6 years ago

Sorry guys but I don't get how a CPU MSR can actually influence a dGPU. PACKAGE_POWER_SKU should be related to the min/max package power, that is the combined power of the CPU+iGPU+cache+whatever in the die.

The problem with the MX150 throttling is probably related to a setting in the dGPU bios or in the system EC/BIOS. Can you force max performance from the nVidia control panel? I guess it won't make any difference but worth a try.

Uatschitchun commented 6 years ago

I'll try writing 78 to 0x614. But that's only in regard to 1.)

No difference if setting dGPU to performance or adaptive. I guess the dGPU's own throttling mechanisms hit in when reaching the max temperature (around 94°). As 76° is far from max, there seems to be some kind of magic, that throttles the CPU's power (down to 3-4W!! - see the reddit post, I linked) for to prevent reaching the dGPU's fall off. With the fix, this magic seems to be disabled. I'll provide some screenshots from s-tui with and without the fix.

There's also a german post in Lenovo forum and from what it seems, Lenovo is aware of the problem. I'll post a link, as I guess I'm not the only german here :-)

Uatschitchun commented 6 years ago

https://forums.lenovo.com/t5/T4-T5-und-neuere-T-Serie/T480-mit-MX150-und-i5-Extremes-CPU-throttling-auf-200-mhz/td-p/4042154

DEvil0000 commented 6 years ago

@erpalma you are right 0x614 should not fixing the GPU throttle but i am curious. However as far as I understand the CPU could send a signal to the GPU for throttle like PROCHOT. @Uatschitchun Lenovo may be aware but it looks like they do not care.

When I did a quick search on google with "mx150 thermal throttle" or similar I found a lot of posts about people having this issue. Their solutions or hints on the issue are very different:

DEvil0000 commented 6 years ago

@Uatschitchun I don't know if that applies - I have not much clue about nvidia in linux. Did you try something like nvidia-smi -q -d PERFORMANCE to see GPU throttle reasons. nvidia-smi

erpalma commented 6 years ago

I would give a look at the nVidia cool bits feature. You should be able to disable performance levels completely.

Uatschitchun commented 6 years ago

Tried both! During my tests, when dGPU reaches its fall-off temp, there's no reason given for throttling! All in all the informations which are possible to get from the MX150 with nvidia-smi are very rare. Supported clocks doesn't work and I haven't had success, until now, to get a xorg.conf running enabling coolbits as of the dual card setup it needs ;(

I'll upload some screenshots later from tests/benchmarks I've done.

Uatschitchun commented 6 years ago

Here are phoronix results (as with higher resolutions, too high temps set in conf result in fall-off, which is clearly seen in results): www.openbenchmarking.org/result/1809079-AR-75GRAD88872

Here are screenshots of running stress, prime & glxsphere. https://www.dropbox.com/sh/s44z07yuhhivi3j/AAD47959bjdDpzplr97u-gXNa?dl=0

nariox commented 6 years ago

Hello all, I'm having a similar issue with my T480. The nvidia-smi command didn't show anything odd, but then I tried running it once every second and then it appears that the driver is switching one of the conditions on and off rapidly (SW Power Cap : Active).

My guess is that there's nothing we can do unless NVIDIA releases some documentation or Lenovo updates the BIOS to remove the limitation. I've tried with a different (supposedly 87W) adapter and the results are the same, so my guess is that the power adapter is not the problem (or at least not the only problem)

erpalma commented 6 years ago

Hmm interesting. Can you please try this: while true; do nvidia-smi -pl 25; sleep 1; done

And see if you get better results?

edit: this is ugly I know..

DEvil0000 commented 6 years ago

while true; do nvidia-smi -pl 25; sleep 1; done

25W is not that much for a GPU

erpalma commented 6 years ago

It's the nominal TDP of the MX150

nariox commented 6 years ago

Unfortunately I get Changing power management limit is not supported for GPU: 00000000:01:00.0.

erpalma commented 6 years ago

You need to set the coolbits with nvidia-xconfig first.

nariox commented 6 years ago

I'm using the right cool bits (please correct me if I'm wrong). On optirun/primus, I have set up the cool bits on /etc/bumblebee/xorg.conf.nvidia, not sure if that matters, but cat /var/log/Xorg.8.log gives me: [ 409.371] (**) NVIDIA(0): Option "Coolbits" "28"

I've also tried it running under nvidia-xrun to similar results.

erpalma commented 6 years ago

Hmm too bad. I don't have access to a T480/580 with discrete GPU so I can't be more helpful sorry :/

nariox commented 6 years ago

That's alright. Unfortunately, the nvidia-smi tool on Linux doesn't support undervolting (as far as I understand), so we are stuck. My best guess would be to flash a custom vBIOS, but that doesn't guarantee the EC is not the one power limiting the GPU.

nariox commented 5 years ago

Seems like BIOS 1.17 changed the behavior, now the dGPU and CPU seem to be limited in power so that the dGPU never reaches thermal throttling (because it is always being throttled a little). Running furmark seems to reduce the maximum power on the CPU to about 5W, but closing it seems to return the CPU to full power. My guess is that the dGPU now takes priority over the power being drawn.

Since it seems like the BIOS/EC are the ones doing the management and there is no Nvidia tool to manage the power of the GPU, I think it would be alright to close this issue, because there's not much lenovo-throttling-fix can do about it.

Uatschitchun commented 5 years ago

Sadly there's only 1.16 Update for T580 stating nothing in regards to dGPU/CPU ;(

nariox commented 5 years ago

The T480's update doesn't mention anything relevant either (except for maybe: Fixed an issue where the system might display the error message "The connected AC adapter has a lower wattage than the recommended AC adapter"., which is present in both updates). So, it might be worth trying it out.

Surprising that the T480/T580 don't use the same BIOS.

erpalma commented 5 years ago

Running furmark seems to reduce the maximum power on the CPU to about 5W, but closing it seems to return the CPU to full power.

Can you still force the CPU to full power with this script?

BTW can you check the new --monitor feature?

nariox commented 5 years ago

Hi Francesco, sorry for the delay, end of semester craze. ):

I've since bought a USB-IF certified 90W supply to rule out the power supply. But the results are the same.

Also, to add to my previous post, I've found that the GPU will still throttle after running furmark for a while, even if CPU is not being stressed and the temperature doesn't get too high. My guess is that the EC freaks out if the GPU is pushed too hard for too long and heavily throttles it. I wonder if this is due to insufficient power circuitry, in any case, I wonder if the 10W MX150 wouldn't have made more sense for this laptop given these issues.

The --monitor feature works great. I never knew my core voltages were so low! :) For CPU throttling, it seems to report the reason for throttling. Unfortunately, it seems like the GPU throttling is not shown and the "power split" CPU throttling is also not shown. But again, I suspect this is happening deep in the EC.

erpalma commented 5 years ago

I think the T580 is very very similar to the T480 both in the cooling system and power circuitry, which are not suitable to really sustain a full CPU+dGPU workload. This is the reason why I chose the T480s with iGPU only. I'm really curious to measure the vregs temperature under full load on our machines... I guess it would be very high.

nariox commented 5 years ago

What's surprising to me is that even when the CPU is not being stressed, we can still reach this situation. I don't have a way of measuring the GPU consumption, but given the TDP, I would assume it is around 25W, whereas the CPU seems to be fine consuming more than that.

At least, it seems like in the throttled state, the MX150 is at least as fast as the iGPU. Which is... okay, I guess?

I wonder if the T480s with the 10W variant MX150 also experience this.

nariox commented 4 years ago

Seems like we can close this (as a won't fix, unfortunately).

According to LINK, it seems the EC is limiting the overall temperature at below 75C, because on Linux the "is_on_lap" sensor is not implemented, so (for UL spec compliance) it limits the temperature.

Since throttled affects only the CPU, the GPU is still bound by the EC imposed limit. But seems like Lenovo might implement a fix (X1 Carbon 7th, Yoga 4, X390 and T490 already have it). Keep an eye at LVFS.