IGCIT / Intel-GPU-Community-Issue-Tracker-IGCIT

IGCIT is a Community-driven issue tracker for Intel GPUs.
GNU General Public License v3.0
115 stars 4 forks source link

If activate max power saving(aspm L1), the GPU can lack of power sometimes and generate BSOD. #762

Open el-psy-k opened 5 months ago

el-psy-k commented 5 months ago

Checklist [README]

Application [Required]

Any

Processor / Processor Number [Required]

AMD Ryzen5 7500F

Graphic Card [Required]

Intel A770LE 16GB

GPU Driver Version [Required]

31.0.101.5382

Other GPU Driver version

No response

Rendering API [Required]

Windows Build Number [Required]

Other Windows build number

No response

Intel System Support Utility report

703

Description and steps to reproduce [Required]

Finally I found the cause of the BSOD(#703). upgraded gpu drivers(5333/5382), replaced PSU(Seasonic FOCUS GX 750), didn't work. still randomly BSOD when powered on 7x24.

I found the same cases from other users. someone suggested to turn off the maximum power saving, I tried it, never BSOD again.

ref: how to activate max power saving(aspm L1) reddit #1 reddit #2 reddit #3 reddit #4

notes: increase the TDR only delays the time when bsod occurs, it doesn't prevent bsod.

Device / Platform

MAG B650M MORTAR WIFI

Crash dumps [Required, if applicable]

No response

Application / Windows logs

No response

freak2fast4u commented 5 months ago

Same here, with 7800X3D on ASUS TUF GAMING B650-PLUS WIFI motherboard + Arc A770 LE, It even bricked Windows 10 in my case, so I've had to remove the GPU, boot with GFX from the APU and repair the OS so it would boot again ... I sure ain't going to try this one again until Intel acknowledges an official fix further down the line ^^"

To be clear, ASPM was enabled in the BIOS, and the system BSOD'd instantly when enabling PCIe maximum power savings in windows, and would boot loop over and over again.

Vivek-Intel commented 5 months ago

@el-psy-k thanks for reporting it. I will check on my end, meanwhile can you share BSDO dump if you have collected with the latest driver.

el-psy-k commented 5 months ago

@Vivek-Intel Just turned on max power saving now, will post here as soon as I collect the BSOD dump.

dieselistus commented 5 months ago

I also have this issue. There's even my comment under one of the links shown in the main post. But I get another BSOD message each time when I enable ASPM - CRITICAL_PROCESS_DIED and I don't know whether it can be analyzed under the same GitHub issue.

dieselistus commented 5 months ago

@Vivek-Intel I'm also willing to provide more details about the BSOD I've encountered, but I need instructions on exactly what is needed and how to collect this information. Thank you.

el-psy-k commented 5 months ago

@Vivek-Intel Here's the newest bsod dump with graphics driver 31.0.101.5445. Memory passes the memtest 48+ hour tests with 0 errors. 043024-16062-01.dmp

el-psy-k commented 4 months ago

@dieselistus generate SSU report obtain crash dumps You can open a new issue. if intel engineers can replicate and fix it, all users will benefit.

Vivek-Intel commented 4 months ago

HI @el-psy-k I have been trying to simulate this issue in our lab, I did not see this issue on my AMD+a770 setup with above said setting and tried playing multiple games. I would do more trials to run different screens and benchmark to see the issue.

pcslide commented 4 months ago

@Vivek-Intel @el-psy-k I suppose the issue is related to certain version of BIOS(AGESA).

freak2fast4u commented 4 months ago

Alright, I was feeling a little frisky today, so I had another shot.

After flashing the latest bios for my mobo (which comes with AGESA 1.1.7.0 patch A) : no bsod this time, but a black screen with some white-line artifacts. For comparison, booting into linux with the same settings produces a black screen with horizontal white lines everywhere. Booting into Linux without ASPM works just fine.

My previous attempts were on Windows 10, this was on Windows 11. This time, I had ASPM enabled in bios (for L1 only), and power savings enabled in windows via power saving power profile, but monitor refresh rate still at 144hz. So far no crash, but GPU still using 40W idle. As soon as I dipped the refresh rate to 60hz (I guess ASPM kicked in at that precise moment) ... and I had a forever black screen >_<"

The odd-ball thing is, after flashing the bios using flashback (not ezflash) and entering the bios config, the same artifacts were there layed all over the ui, making the whole thing unusable. I chalk this up to the bios having ASPM enabled by default (I checked this). So it's definitely not an OS issue, and not a driver issue either. It has to be strictly firmware related, based on what I've seen today, and I'm still suspecting the GPU's VBIOS, not necessarily the mobo's BIOS.

@Vivek-Intel : can you take this new information into account when testing on your side ? I'll try cross-testing with my RX 5700XT later on and will keep you updated.

el-psy-k commented 4 months ago

@pcslide Tried every versions of bios, doesn't help.

el-psy-k commented 4 months ago

@freak2fast4u You can post BSOD dump here, caused by igdkmdnd64.sys?

Vivek-Intel commented 4 months ago

Hi @el-psy-k I have kept my AMD host +A770 system under test for weekend with multiple things running, maximum power saving on, ASPM L1/L0 enabled. I will let you know if I see the issue at my end. I will ask my team to try it out on other host as we do not have exact same motherboard model as yours.

I am referring to SSU you shared in old case but I hope you are using latest driver, latest BIOS. can you share VBIOS version of GPU?

@freak2fast4u Thank you for testing, blank screen after changing refresh rate may or may not be the same as this issue or ASPM specific. I would suggest please try another monitor if possible and create a new thread so that we can isolate it better. I did try the using 144Hz monitor and scaling down the refresh rate but I could not see the issue that you faced.

el-psy-k commented 4 months ago

image image

@Vivek-Intel IFWI(V-BIOS): 20.0.1068 Graphics Driver Version: 31.0.101.5445 MotherBoard Bios Version: 7D76vAB(AGESA 1.1.0.2b)

Tried 7D76vAC (AGESA 1.1.0.2b Patch A), if turn on the ASPM L1, random BSOD as usual. Also 7D76vAC has a bug with high idle cpu usage, so I rolled back to prev version.

image image Please use HWiNFO to check ASPM status, make sure it's L1 Entry.

BSOD usually occur when the GPU load changes frequently, And turning the monitor off and on multiple times. Suggestion: Use the --enable-features=IntelVpSuperResolution command line to launch Chrome to play long videos at a lower resolution than your monitor, so that the GPU load is constantly changing. Then use an automation tool such as AutoHotKey to turn the monitor off and on at intervals. This might help you to reproduce the issue.

el-psy-k commented 4 months ago

@Vivek-Intel BSOD again, the system was idle and the monitor was in sleep mode when this happened. 050624-16046-01.dmp

Vivek-Intel commented 4 months ago

Thanks @el-psy-k . I am checking with developers but please know that this issue is inconsistent and I was able to reproduce only once while using system continually for past week so it might take time if developers need live debug or more information on this issue to root cause it.

Vivek-Intel commented 4 months ago

I have opened a issue with engineering team bug id - 15016023487 for your reference. We can not commit any time for progress on this issue but will keep you all updated if there is any news on this one.

Vivek-Intel commented 1 week ago

Hi @el-psy-k Can you please test latest driver and see if issue is happening ? 101.6077

el-psy-k commented 1 week ago

image image

https://github.com/user-attachments/assets/48eec383-192a-442c-b00f-1a54f22ff773

092024-15328-01.dmp

@Vivek-Intel When I turned on ASPM L1 and after a day, BSOD occurred. (clean install with DDU) This time the phone was right next to me, and I recorded the screen when the BSOD occurred, it may be blurry.

bios: 7D76vAH(AGESA ComboPI 1.2.0.0a Patch A)