IGCIT / Intel-GPU-Community-Issue-Tracker-IGCIT

IGCIT is a Community-driven issue tracker for Intel GPUs.
GNU General Public License v3.0
115 stars 4 forks source link

Random BSODs after latest driver update #677

Closed debjit-das closed 8 months ago

debjit-das commented 8 months ago

Checklist [README]

Game [Required]

Witcher 3, GTA V, the Hunter COTW

Game Platform [Required]

Other game platform

No response

Processor / Processor Number [Required]

i5 11400f

Graphic Card [Required]

A750 LE

GPU Driver Version [Required]

31.0.101.5085

Other GPU Driver version

No response

Rendering API [Required]

Windows Build Number [Required]

Other Windows build number

No response

Intel System Support Utility report

ssulog_17Jan24.txt

Description and steps to reproduce [Required]

I decided to take the plunge and installed the latest 31.0.101.5081/31.0.101.5122 driver for my A750 LE. The installation failed midway and then I had to DDU and install it again. The wallpaper went black on the next reboot after the installation. Launched my game and it crashed with NO SIGNAL on my display after just 2 mins of gameplay.
Searched the event viewer for issues and found out that the bugcheck was: 0x00000116 or : 0x0000007e (The bugcheck error code "0x00000116" refers to a VIDEO_TDR_ERROR, which is a Blue Screen of Death (BSOD) error in Windows. This error typically indicates a problem with the graphics hardware or graphics driver.) Now, I tried rolling back to previous drivers (4972 and 5074) but to no avail. Cannot play any game more than 5 mins now. Playing any game causes BSOD now at random times. Sometimes its 2 mins and sometimes its 5 mins. Checked all temps on overlay while playing and nothing jumps over 60-65.

Game graphic quality [Required]

Game resolution [Required]

1920x1080

Game VSync [Required]

On

Game display mode [Required]

Detailed game settings [Required]

playing on Medium to High settings

Device / Platform name

No response

Crash dumps [Required, if applicable]

No response

Save game

It is happening across all games

debjit-das commented 8 months ago

Here is a video of the issue I just uploaded to youtube. Apologies for the poor video as I had to use the mobile for capturing.

https://youtu.be/OUBLI6RWYOc?si=koaKO9eFjCw9StCD

You may notice that the sound continues to play for sometime even after the display has crashed. Then the system finally crashes and recovers as a reboot. No buttons were touched after the "continue" button in the game.

freak2fast4u commented 8 months ago

Aaahhhhh F. I was gonna install 5085 to check if another problem of mine was fixed, seems like a bad idea lmao. Well, I'll proceed anyway, I want answers lol. I have an 7800X3D and an A770 LE, running on Windows 10 so results might vary, which will be interesting. I happen to also own The Witcher 3 so I'll try that out.

I'm running off a TUF GAMING B650-PLUS WIFI mobo with BIOS 2214 released on 4th January 2024. I see you have a TUF GAMING B560-PLUS WIFI mobo with BIOS 2001 released 2nd March 2023 (latest available still today). Also slightly off-topic, I hope Asus rolls out an update for your mobo regarding the LogoFAIL vulnerability, it's wild.

Have you tried disabling XMP ? I'm not saying you should have to, I mean trying this as a comparison, DDR5 support is fragile on certain motherboards, though I suspect you might have noticed that already if that was the case.

Also, the fact that reinstalling older drivers doesn't fix the issue tells me it has to do either with a windows update that happened at the same time (maybe you can track that down and remove it, I have done this in the past for Windows 10 updates breaking Arc Control completely, surely Windows 11 will allow you to do that too), or that it's VBIOS related. Someone reported a way to downgrade the VBIOS down to an earlier version, but please be super careful about which files you pick, else you might brick your GPU : https://github.com/IGCIT/Intel-GPU-Community-Issue-Tracker-IGCIT/issues/430#issuecomment-1870046788

The project page is here : https://github.com/Solaris17/ARC-Firmware-Tool/ The full guide is here : https://github.com/Solaris17/ARC-Firmware-Tool/blob/master/docs/guide.md The firmware files can be found here : https://github.com/Solaris17/Arc-Firmware You can also find them inside the driver installer package from Intel, but they don't advertise older versions anymore, so unless you kept them over time, this repo will be the only place to get them. If the flashing log says the flashing timed out after 300s, reflash with the same files until you get an OK 100% and no timeouts, else you risk bricking your GPU, even if you selected the correct firmware and oprom files. You will get read errors for files you didn't pick in the UI, but that's OK.

I rolled my VBIOS back from 1068 to 1064 using firmware files from driver 3962 to fix some issues under Linux (artifacting, randomly unuseable desktop environment), and other users had to rollback to fix issues with fan speed on their non-LE cards (notably Acer Predator branded), as you'll see in issue 430 I linked above. I was also going to try firmware files from 3975 which seems to have a more complete set of files for my card, but I'll keep that for another day.

freak2fast4u commented 8 months ago

@debjit-das : FF7 remake is a DX12 title by default, unless you specify -dx11 as a launch parameter. Do you happen to have this game too ? If so, could you compare with and without that launch parameter ? Otherwise no big deal, there's other games to test with.

I had a flawless experience playing FF7 remake with older drivers, but I haven't launched it in a while, so I'll give it a try with the latest drivers as well, for good measure (also I haven't played the intergrade chapter yet, so I suppose this'll be a good excuse to play through it :p).

Anyway, I just installed 5085 drivers, and they updated my VBIOS back to 1068 (as expected), and witcher 3 just finished installing, I will report back later.

Edit : well, it's not looking good ... I launched a fresh game of The Witcher 3 (my version is from GOG btw ...), went through 10 minutes of intro video, and just a few seconds after the first 3D-rendered scene appeared (Geralt sleeping on the ground), the game crashed : image Here are the logs + dmp file produced by the game : Witcher3_20240117_172151993.zip

Edit 2 : the crash is related to XeSS. I switched to FSR and it's running smoothly. As soon as I enable XeSS from the menu, the game crashes immediately. chuckle

vitduck commented 8 months ago

@debjit-das

When the screen goes black during gaming, it could be related to an issue with power supply. So I would recommend you to rule this out first by:

If not, then it is likely that your firmware is corrupted during installation. However, Intel driver installer does not allow firmware downgrade, but only upgrade. Since you are using the latest firmware, the installer will not attempt to flash the bios again when reinstalling the driver.

To this end, you can try flashing the bios yourself. I'd recommend 4953. Hope it helps.

freak2fast4u commented 8 months ago

@debjit-das : Also, if by any chance you pushed past 220W power limit on the card, you WILL need a second cable, not just a Y/splitter cable. By default, a single cable should be enough though. The PCIe slot from the mobo gives you ~75W, and an 8-pin power cable from any of the PSU's "PCIe"-labeled ports should give you 150W (for a total of 225W available to the GPU).

On my side, I managed to play 30 minutes of The Witcher 3 with no crashes (except for that XeSS quirk) and 2 hours of FF7 remake with zero bugs using the 5085 drivers in DX12 mode. As stated above, we have wildly different configurations though, so I hope someone else with a configuration closer to yours can pitch in with some feedback.

debjit-das commented 8 months ago

Aaahhhhh F. I was gonna install 5085 to check if another problem of mine was fixed, seems like a bad idea lmao. Well, I'll proceed anyway, I want answers lol. I have an 7800X3D and an A770 LE, running on Windows 10 so results might vary, which will be interesting. I happen to also own The Witcher 3 so I'll try that out.

I'm running off a TUF GAMING B650-PLUS WIFI mobo with BIOS 2214 released on 4th January 2024. I see you have a TUF GAMING B560-PLUS WIFI mobo with BIOS 2001 released 2nd March 2023 (latest available still today). Also slightly off-topic, I hope Asus rolls out an update for your mobo regarding the LogoFAIL vulnerability, it's wild.

Have you tried disabling XMP ? I'm not saying you should have to, I mean trying this as a comparison, DDR5 support is fragile on certain motherboards, though I suspect you might have noticed that already if that was the case.

Also, the fact that reinstalling older drivers doesn't fix the issue tells me it has to do either with a windows update that happened at the same time (maybe you can track that down and remove it, I have done this in the past for Windows 10 updates breaking Arc Control completely, surely Windows 11 will allow you to do that too), or that it's VBIOS related. Someone reported a way to downgrade the VBIOS down to an earlier version, but please be super careful about which files you pick, else you might brick your GPU : #430 (comment)

The project page is here : https://github.com/Solaris17/ARC-Firmware-Tool/ The full guide is here : https://github.com/Solaris17/ARC-Firmware-Tool/blob/master/docs/guide.md The firmware files can be found here : https://github.com/Solaris17/Arc-Firmware You can also find them inside the driver installer package from Intel, but they don't advertise older versions anymore, so unless you kept them over time, this repo will be the only place to get them. If the flashing log says the flashing timed out after 300s, reflash with the same files until you get an OK 100% and no timeouts, else you risk bricking your GPU, even if you selected the correct firmware and oprom files. You will get read errors for files you didn't pick in the UI, but that's OK.

I rolled my VBIOS back from 1068 to 1064 using firmware files from driver 3962 to fix some issues under Linux (artifacting, randomly unuseable desktop environment), and other users had to rollback to fix issues with fan speed on their non-LE cards (notably Acer Predator branded), as you'll see in issue 430 I linked above. I was also going to try firmware files from 3975 which seems to have a more complete set of files for my card, but I'll keep that for another day.

Thank you so much. I was looking for exactly this. Since there were no recent hardware changes and the system went unstable only after the 5085 driver update, I immediately guessed that the update might have messed up the firmware as rolling back to previous versions did not help. I have some experience in Bios updates for AMD 6970 cards and NVIDIA cards. But, back then things were different and probably too easy.

I also tried disabling XMP, but unfortunately that did not help. Also my last windows update was KB5034123 which was on 10th Jan which could not have caused this is my wild guess.

Gonna try to downgrade the vBIOS and hope it will work. I just want to be able to play again. At this moment none of my games work more than 5 mins at max. My current vBIOS version reads 1068, which one would you suggest I should go for?

debjit-das commented 8 months ago

@debjit-das : Also, if by any chance you pushed past 220W power limit on the card, you WILL need a second cable, not just a Y/splitter cable. By default, a single cable should be enough though. The PCIe slot from the mobo gives you ~75W, and an 8-pin power cable from any of the PSU's "PCIe"-labeled ports should give you 150W (for a total of 225W available to the GPU).

On my side, I managed to play 30 minutes of The Witcher 3 with no crashes (except for that XeSS quirk) and 2 hours of FF7 remake with zero bugs using the 5085 drivers in DX12 mode. As stated above, we have wildly different configurations though, so I hope someone else with a configuration closer to yours can pitch in with some feedback.

I am not able to play Witcher 3 for more than 10-15 seconds now. GTA V also is the same. Tried playing a DX11 game and did not face any issue for the 2 hours I played. Launched another DX12 title (the Hunter: COTW) and it crashed too but I was able to play for 15 mins the 1st time. Subsequently, it crashed after every 2-3 mins. Unfortunately, I do not have FF7 else could have tried.

debjit-das commented 8 months ago

I guess I spoke too soon. The card finally gave up. I have a blank screen and the bios fails to detect the video card anymore. I will try and connect it to a different system later and see if its the card or my system.

freak2fast4u commented 8 months ago

Ouch, good luck then ...

FWIW, in the event the GPU ends up being "OK" and you still want to proceed in flashing the VBIOS, for choosing the right VBIOS files for your card, there are some crucial clues to collect in the firmware installation logs in this folder : C:\Intel\FWUpdateService\ (in particular the choice of SOC1, SOC2 or SOC3). Look for this : image

In fact, there may even be some clues in those logs as to wether the flashing failed in the first place.

I used the fw files from driver 3962 at first, based on vitduck's findings, but I was going to test 3975 a little later since 3962 is kind of an odd one (it has a particuliar hash amongst other versions, at least for SOC1 in my case, I realized this only later on) : image

If the VBIOS flash went wrong, I suppose re-flashing the latest available version would suffice. If it's an issue with the VBIOS itself, then you'd need to rollback as described earlier.

debjit-das commented 8 months ago

Here is what I could extract from my logs for 18 Dec update-

[2023/12/18--21:30:5:33] : File Path : C:\Windows\System32\DriverStore\FileRepository\iigd_dch_d.inf_amd64_a89e49ec8f7d07fc\FW\dg2_gfx_fwupdate_SOC1.bin [2023/12/18--21:30:5:34] : File Version :FW Version: DG02->1->3257 [2023/12/18--21:30:5:34] : matched:C:\Windows\System32\DriverStore\FileRepository\iigd_dch_d.inf_amd64_a89e49ec8f7d07fc\FW\dg2_gfx_fwupdate_SOC1.bin [2023/12/18--21:30:5:34] : Firmware code version is equal

[2023/12/18--21:30:5:39] : File Path : C:\Windows\System32\DriverStore\FileRepository\iigd_dch_d.inf_amd64_a89e49ec8f7d07fc\OPROM\dg2_c_oprom.rom File Type : 3 Oprom Type : 2 Given image Version : 20.1068.00 [2023/12/18--21:30:5:63] : The image is compatible with the device [2023/12/18--21:30:5:63] : Installed version is newer or equal

[2023/12/18--21:30:5:78] : File Path : C:\Windows\System32\DriverStore\FileRepository\iigd_dch_d.inf_amd64_a89e49ec8f7d07fc\OPROM\dg2_d_intel_a750_oprom-data.rom File Type : 4 Oprom Type : 1 Given image Version : 20.1068.00 [2023/12/18--21:30:5:79] : The image is compatible with the device [2023/12/18--21:30:5:80] : Installed version is newer or equal

[2023/12/18--21:30:5:107] : File Path : C:\Windows\System32\DriverStore\FileRepository\iigd_dch_d.inf_amd64_a89e49ec8f7d07fc\FWDATA\dg2_intel_a750_config-data.bin File Type : 5 File Version : major_version->101-> oem_manuf_data_version->15-> major_vcn->1 [2023/12/18--21:30:5:107] : The image's 4-ID manifest extenstion matched with the device [2023/12/18--21:30:5:113] : firmware data version is not compatible with the installed one (OEM version)

This means I should use the following in the FW tool, please correct me if I am wrong

FW : dg2_gfx_fwupdate_SOC1.bin OPROM (Code) : dg2_c_oprom.rom OPROM (Data) : dg2_d_intel_a750_oprom-data.rom Now from what my log says, I am confused if I should be using the config data or not, i.e., the "dg2_intel_a750_config-data.bin" file since it also notes that "firmware data version is not compatible with the installed one (OEM version)".

Also, when using the tool to check files, both the dg2_c_oprom.rom and the dg2_d_intel_a750_oprom-data.rom files check out as Oprom-Code and Oprom-Data file. Again confused which should be used as what. I know the guide says to consider c as code file and d as data file.

Appreciate any help and sorry if I'm turning out dumber than you expected.

debjit-das commented 8 months ago

I am planning to use safe mode while flashing the vbios. Is that okay or normal startup is okay?

freak2fast4u commented 8 months ago

@debjit-das : allow me ... "I had zero expectations yet you managed to disappoint me" ... /jk

Nah bro, you are showing a healthy amount of diligence, agency, willingness to communicate, courtesy and your analysis is spot on. If you think that's being dumb, then let's be dumb together ;)

Back on topic, about the fw data I have exactly the same message, so I don't think it should be a problem.

Using safe mode wasn't necessary for me (after all, the intel arc driver installer doesn't care and proceeds in normal mode anyway), but now you mention it maybe I was a little careless and got lucky. It might be a safe precaution. Anyhow, double checking the flash log inside the Arc Firmware Tool tool to make sure flashing hasn't timed out is crucial imho, I wouldn't want to take any bet on rebooting if it timed out and I would just re-flash until I get a clean 100%.

Good luck !

debjit-das commented 8 months ago

@freak2fast4u Thank you so much bro. You have been really patient with me and have been really helpful.

I was unable to get any signal from the card from the last 2 hrs or so. Don't know what came to my mind and I did a bios reset in frustration. Lo and behold, the screen came back to life. Straight to Safe mode and tried to flash the bios, now the tool fails to acknowledge my Intel GPU. Says no supporting hardware found. Went into normal mode and the device manager says it's a Microsoft Basic display adapter. Went back to safe mode again to use DDU to flush all drivers and restarted to install 4972 driver in normal mode. Driver installation went half way through when the screen starts to blink and from there I again received No Signal on my display.

In safe mode or normal mode, strangely the flash tool fails to recognise the card. HW Scan in the tool says no supported adapters found and even if I select the fw files and start flash, it says the same.

freak2fast4u commented 8 months ago

Aww man, we're a couple users that went through this "installing driver leads to a black screen" shenanigan. See this thread : https://github.com/IGCIT/Intel-GPU-Community-Issue-Tracker-IGCIT/issues/641

Also you're very welcome ! I'm just a regular Intel Arc user that likes to fiddle around with computers in general, I'm happy to help :p

I somehow managed to work around this problem by re-enabling my iGPU (I suppose, I'm not even certain tbh). It seems Windows can't initialize the new driver sometimes, and needs a "fallback" display to continue (just interpreting based on system behavior, I can't do much more). I thought it was a 5xxx+ driver series issue, but now you have it on 4972 it's starting to look random.

What happens if you plug the screen into the motherboard while installing the Arc driver ? Is that an option for you ? I noticed Windows will automatically pass-through whatever GPU's output to whatever GPU's port a screen is connected to, so it's worth a try.

Another idea I had, was to maybe install the driver alone without Arc Control, then reboot, and then install Arc Control (no need to reboot for Arc Control).

Oh, and some guy from youtube (GraphicArc) mentionned on Discord that disabling XMP/DOCP would sometimes help in unlocking driver installation specifically, then you can turn it back on again.

freak2fast4u commented 8 months ago

Heh, I almost got myself into trouble here, this is was I was talking about (timeout) : 2024-01-18 15_49_03-ARC Firmware Tool

I just clicked on "Flash" a second time and it went through in less than a minute without errors (except the usual fwdata mismatch at the end).

debjit-das commented 8 months ago

Dang, meanwhile I got mine back and was somehow able to install the 5081 version on it. Don't know why 4xxxx versions dint want to load. Screen went blank and no display.

But, with 5081, screen went blank too and after a hard reset windows failed to show login screen twice. Third time was a charm and right now I'm playing witcher 3 from the last 15 mins without crashing. Fingers crossed.

Are you trying to downgrade back to vBIOS 1064 again?

debjit-das commented 8 months ago

I somehow managed to work around this problem by re-enabling my iGPU (I suppose, I'm not even certain tbh).

My biggest issue is I do not have igpu. Don't ask me how many hairs I have lost trying to use the 1050Ti I have lying around. Sometimes the mobo does not detect the arc card, sometimes it refuses the nvidia card. On top of that my watercooling pipes are a mess when trying to fiddle around with 2 gpus.

debjit-das commented 8 months ago

XMP is what helped me I guess. Resetting the bios turned off XMP and that might have been the solution to a successful driver install in the end. I'm yet to enable it though, more likely I'm afraid to do so. Haha

freak2fast4u commented 8 months ago

Funny how your CPU indeed doesn't have an iGPU but Intel will gladly direct you to download Intel Xe drivers : https://www.intel.fr/content/www/fr/fr/products/sku/212271/intel-core-i511400f-processor-12m-cache-up-to-4-40-ghz/downloads.html ... facepalm

TIL not all Intel CPUs have an iGPU, I thought that was an AMD-only thing (Ryzen 1xxx-5xxx non-G variants anyway).

On my end I will be sticking to VBIOS 1064 for now as I triple boot windows 10, linux mint, and manjaro, and on VBIOS 1068 I can't reboot into linux without horrible artifacting and having the desktop environment freeze up before I can even open a session. I'll have to report this to the mesa team at some point since Linux-only users won't ever have their VBIOS updated and will never notice this is a problem. It would take a multi-booter like me to notice. Surely I'm not alone but I'm amazed this has been going on for months with no fix in sight.

Glad you got it up and working again anyway ! (that sounded different in my mind I swear ... lol) That's a massive win, especially considering where you were standing a few hours ago :o

This thread will surely benefit others so thanks for reporting !

Happy hunting !

debjit-das commented 8 months ago

Thanks a lot mate, you were very helping, kind and generous. Just standing together with a helping hand is what matters in coming out victorious. I'm glad I had posted here, was heard and shown very good direction.

I hope you will soon have a driver that won't mess up your multiboot. I trust Intel will answer our prayers soon. Kudos. Gracias. Thank you. Cheers