IGCIT / Intel-GPU-Community-Issue-Tracker-IGCIT

IGCIT is a Community-driven issue tracker for Intel GPUs.
GNU General Public License v3.0
115 stars 4 forks source link

RDP connections to Windows machine with Arc Graphics card occasionally crash the system #847

Open yetdragon opened 3 weeks ago

yetdragon commented 3 weeks ago

Checklist [README]

Application [Required]

Remote Desktop (RDP)

Processor / Processor Number [Required]

13500

Graphic Card [Required]

Arc A770 LE

GPU Driver Version [Required]

32.0.101.5972

Other GPU Driver version

No response

Rendering API [Required]

Windows Build [Required]

Windows 11 23H2

Other Windows build

No response

Intel System Support Utility report

igcit_ssu.txt

Description and steps to reproduce [Required]

Related to #577.

RDPing to a Windows computer with an Arc graphics card occasionally crashes the system and forces a reboot. I could not identify a definitive pattern, but it tends to happen when I RDP into a computer after some time has passed since booting the computer, though I'm not certain.

The issue seems to be an Intel graphics driver initialization error. I attached Windows logs below.

Since I'm not physically present with the computer, I can't update the driver, and attempting to do so through RDP caused some trouble in the past (I'm confident the issue would persists even after driver updates).

Device / Platform

No response

Crash dumps [Required, if applicable]

compressed_dumps.zip

Application / Windows logs

This is the error from the system section of Windows event viewer (Meaning the computer is rebooted after error check):

컴퓨터가 오류 검사 후 다시 부팅되었습니다. 오류 검사: 0x00000116 (0xffff9784506bb050, 0xfffff80582541e60, 0x0000000000000000, 0x000000000000000d). 덤프 저장 위치: C:\Windows\Minidump\092424-14390-01.dmp. 보고서 ID: 0b6b3678-309d-4913-a200-3171cab318d

I checked the Minidump file in the event log with WinDbg. It says that the error is VIDEO_TDR_FAILURE (116): Attempt to reset the display driver and recover from timeout failed. It seems the Intel graphics driver sometimes fails to initialize on RDP connection for some reason.

************* Preparing the environment for Debugger Extensions Gallery repositories **************
   ExtensionRepository : Implicit
   UseExperimentalFeatureForNugetShare : true
   AllowNugetExeUpdate : true
   NonInteractiveNuget : true
   AllowNugetMSCredentialProviderInstall : true
   AllowParallelInitializationOfLocalRepositories : true
   EnableRedirectToChakraJsProvider : false

   -- Configuring repositories
      ----> Repository : LocalInstalled, Enabled: true
      ----> Repository : UserExtensions, Enabled: true

>>>>>>>>>>>>> Preparing the environment for Debugger Extensions Gallery repositories completed, duration 0.000 seconds

************* Waiting for Debugger Extensions Gallery to Initialize **************

>>>>>>>>>>>>> Waiting for Debugger Extensions Gallery to Initialize completed, duration 0.031 seconds
   ----> Repository : UserExtensions, Enabled: true, Packages count: 0
   ----> Repository : LocalInstalled, Enabled: true, Packages count: 42

Microsoft (R) Windows Debugger Version 10.0.27704.1001 AMD64
Copyright (c) Microsoft Corporation. All rights reserved.

Loading Dump File [C:\Users\<USERNAME>\Downloads\092424-14390-01.dmp]
Mini Kernel Dump File: Only registers and stack trace are available

Symbol search path is: srv*
Executable search path is: 
Windows 10 Kernel Version 22621 MP (20 procs) Free x64
Product: WinNt, suite: TerminalServer SingleUserTS
Kernel base = 0xfffff805`55e00000 PsLoadedModuleList = 0xfffff805`56a134f0
Debug session time: Tue Sep 24 07:39:24.362 2024 (UTC + 9:00)
System Uptime: 0 days 1:37:34.297
Loading Kernel Symbols
...............................................................
................................................................
................................................................
......................................
Loading User Symbols

Loading unloaded module list
....................
For analysis of this file, run !analyze -v
nt!KeBugCheckEx:
fffff805`56215cb0 48894c2408      mov     qword ptr [rsp+8],rcx ss:0018:fffffc81`0ad4f700=0000000000000116
0: kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

VIDEO_TDR_FAILURE (116)
Attempt to reset the display driver and recover from timeout failed.
Arguments:
Arg1: ffff9784506bb050, Optional pointer to internal TDR recovery context (TDR_RECOVERY_CONTEXT).
Arg2: fffff80582541e60, The pointer into responsible device driver module (e.g. owner tag).
Arg3: 0000000000000000, Optional error code (NTSTATUS) of the last failed operation.
Arg4: 000000000000000d, Optional internal context dependent data.

Debugging Details:
------------------

Unable to load image igdkmdnd64.sys, Win32 error 0n2
*** WARNING: Unable to verify timestamp for igdkmdnd64.sys

KEY_VALUES_STRING: 1

    Key  : Analysis.CPU.mSec
    Value: 1156

    Key  : Analysis.Elapsed.mSec
    Value: 9606

    Key  : Analysis.IO.Other.Mb
    Value: 3

    Key  : Analysis.IO.Read.Mb
    Value: 0

    Key  : Analysis.IO.Write.Mb
    Value: 24

    Key  : Analysis.Init.CPU.mSec
    Value: 281

    Key  : Analysis.Init.Elapsed.mSec
    Value: 39034

    Key  : Analysis.Memory.CommitPeak.Mb
    Value: 107

    Key  : Analysis.Version.DbgEng
    Value: 10.0.27704.1001

    Key  : Analysis.Version.Description
    Value: 10.2408.27.01 amd64fre

    Key  : Analysis.Version.Ext
    Value: 1.2408.27.1

    Key  : Bugcheck.Code.LegacyAPI
    Value: 0x116

    Key  : Bugcheck.Code.TargetModel
    Value: 0x116

    Key  : Dump.Attributes.AsUlong
    Value: 1808

    Key  : Dump.Attributes.DiagDataWrittenToHeader
    Value: 1

    Key  : Dump.Attributes.ErrorCode
    Value: 0

    Key  : Dump.Attributes.KernelGeneratedTriageDump
    Value: 1

    Key  : Dump.Attributes.LastLine
    Value: Dump completed successfully.

    Key  : Dump.Attributes.ProgressPercentage
    Value: 0

    Key  : Failure.Bucket
    Value: 0x116_IMAGE_igdkmdnd64.sys

    Key  : Failure.Hash
    Value: {7eb0cc99-c85b-4092-6430-8f1db059b7c1}

    Key  : Hypervisor.Enlightenments.ValueHex
    Value: 1417df84

    Key  : Hypervisor.Flags.AnyHypervisorPresent
    Value: 1

    Key  : Hypervisor.Flags.ApicEnlightened
    Value: 0

    Key  : Hypervisor.Flags.ApicVirtualizationAvailable
    Value: 1

    Key  : Hypervisor.Flags.AsyncMemoryHint
    Value: 0

    Key  : Hypervisor.Flags.CoreSchedulerRequested
    Value: 0

    Key  : Hypervisor.Flags.CpuManager
    Value: 1

    Key  : Hypervisor.Flags.DeprecateAutoEoi
    Value: 1

    Key  : Hypervisor.Flags.DynamicCpuDisabled
    Value: 1

    Key  : Hypervisor.Flags.Epf
    Value: 0

    Key  : Hypervisor.Flags.ExtendedProcessorMasks
    Value: 1

    Key  : Hypervisor.Flags.HardwareMbecAvailable
    Value: 1

    Key  : Hypervisor.Flags.MaxBankNumber
    Value: 0

    Key  : Hypervisor.Flags.MemoryZeroingControl
    Value: 0

    Key  : Hypervisor.Flags.NoExtendedRangeFlush
    Value: 0

    Key  : Hypervisor.Flags.NoNonArchCoreSharing
    Value: 1

    Key  : Hypervisor.Flags.Phase0InitDone
    Value: 1

    Key  : Hypervisor.Flags.PowerSchedulerQos
    Value: 0

    Key  : Hypervisor.Flags.RootScheduler
    Value: 0

    Key  : Hypervisor.Flags.SynicAvailable
    Value: 1

    Key  : Hypervisor.Flags.UseQpcBias
    Value: 0

    Key  : Hypervisor.Flags.Value
    Value: 21631230

    Key  : Hypervisor.Flags.ValueHex
    Value: 14a10fe

    Key  : Hypervisor.Flags.VpAssistPage
    Value: 1

    Key  : Hypervisor.Flags.VsmAvailable
    Value: 1

    Key  : Hypervisor.RootFlags.AccessStats
    Value: 1

    Key  : Hypervisor.RootFlags.CrashdumpEnlightened
    Value: 1

    Key  : Hypervisor.RootFlags.CreateVirtualProcessor
    Value: 1

    Key  : Hypervisor.RootFlags.DisableHyperthreading
    Value: 0

    Key  : Hypervisor.RootFlags.HostTimelineSync
    Value: 1

    Key  : Hypervisor.RootFlags.HypervisorDebuggingEnabled
    Value: 0

    Key  : Hypervisor.RootFlags.IsHyperV
    Value: 1

    Key  : Hypervisor.RootFlags.LivedumpEnlightened
    Value: 1

    Key  : Hypervisor.RootFlags.MapDeviceInterrupt
    Value: 1

    Key  : Hypervisor.RootFlags.MceEnlightened
    Value: 1

    Key  : Hypervisor.RootFlags.Nested
    Value: 0

    Key  : Hypervisor.RootFlags.StartLogicalProcessor
    Value: 1

    Key  : Hypervisor.RootFlags.Value
    Value: 1015

    Key  : Hypervisor.RootFlags.ValueHex
    Value: 3f7

BUGCHECK_CODE:  116

BUGCHECK_P1: ffff9784506bb050

BUGCHECK_P2: fffff80582541e60

BUGCHECK_P3: 0

BUGCHECK_P4: d

FILE_IN_CAB:  092424-14390-01.dmp

TAG_NOT_DEFINED_202b:  *** Unknown TAG in analysis list 202b

DUMP_FILE_ATTRIBUTES: 0x1808
  Kernel Generated Triage Dump

FAULTING_THREAD:  ffff978437b4f0c0

VIDEO_TDR_CONTEXT: dt dxgkrnl!_TDR_RECOVERY_CONTEXT ffff9784506bb050
Symbol dxgkrnl!_TDR_RECOVERY_CONTEXT not found.

PROCESS_OBJECT: 000000000000000d

BLACKBOXBSD: 1 (!blackboxbsd)

BLACKBOXNTFS: 1 (!blackboxntfs)

BLACKBOXPNP: 1 (!blackboxpnp)

BLACKBOXWINLOGON: 1

CUSTOMER_CRASH_COUNT:  1

PROCESS_NAME:  System

STACK_TEXT:  
fffffc81`0ad4f6f8 fffff805`5b0bb00e     : 00000000`00000116 ffff9784`506bb050 fffff805`82541e60 00000000`00000000 : nt!KeBugCheckEx
fffffc81`0ad4f700 fffff805`5b0ba699     : fffff805`82541e60 ffff9784`506bb050 fffffc81`0ad4f819 00000000`0000050c : dxgkrnl!TdrBugcheckOnTimeout+0xfe
fffffc81`0ad4f740 fffff805`9db27bf6     : 00000000`0000050c ffff9784`2d5b65c8 ffff9784`2d5b65d0 ffff9784`2d5b65d8 : dxgkrnl!TdrIsRecoveryRequired+0x1b9
fffffc81`0ad4f770 fffff805`9dbb3d99     : ffff9784`3dee9000 00000000`00000000 ffff9784`3dee9000 00000000`00000000 : dxgmms2!VidSchiReportHwHang+0x5fe
fffffc81`0ad4f880 fffff805`9db85919     : 00000000`00000000 00000000`00000000 00000000`0005b619 00000000`00989680 : dxgmms2!VidSchiCheckHwProgress+0x2e459
fffffc81`0ad4f900 fffff805`9dae6ae1     : 00000000`00000000 ffff9784`3dee9000 fffffc81`0ad4fa39 00000000`00000000 : dxgmms2!VidSchiWaitForSchedulerEvents+0x389
fffffc81`0ad4f9d0 fffff805`9db9a405     : ffff9784`37d86000 ffff9784`3dee9000 ffff9784`37d86060 ffff9784`370098a0 : dxgmms2!VidSchiScheduleCommandToRun+0x291
fffffc81`0ad4faa0 fffff805`9db9a37a     : 00000000`00000000 fffff805`9db9a2b0 ffff9784`3dee9000 00000000`00050246 : dxgmms2!VidSchiRun_PriorityTable+0x35
fffffc81`0ad4faf0 fffff805`56154d07     : ffff9784`37b4f0c0 fffff805`00000001 ffff9784`3dee9000 006fe47f`b19bbdff : dxgmms2!VidSchiWorkerThread+0xca
fffffc81`0ad4fb30 fffff805`5621ae24     : ffffd580`86851180 ffff9784`37b4f0c0 fffff805`56154cb0 cccccccc`cccccccc : nt!PspSystemThreadStartup+0x57
fffffc81`0ad4fb80 00000000`00000000     : fffffc81`0ad50000 fffffc81`0ad49000 00000000`00000000 00000000`00000000 : nt!KiStartSystemThread+0x34

SYMBOL_NAME:  igdkmdnd64+11e60

MODULE_NAME: igdkmdnd64

IMAGE_NAME:  igdkmdnd64.sys

STACK_COMMAND:  .process /r /p 0xffff9784238f3040; .thread 0xffff978437b4f0c0 ; kb

FAILURE_BUCKET_ID:  0x116_IMAGE_igdkmdnd64.sys

OSPLATFORM_TYPE:  x64

OSNAME:  Windows 10

FAILURE_ID_HASH:  {7eb0cc99-c85b-4092-6430-8f1db059b7c1}

Followup:     MachineOwner
---------
Vivek-Intel commented 1 week ago

@yetdragon thanks for reporting the issue. I see issue #577 is marked as related and we have tried to duplicate this issue on different Arc systems multiple times and also my peer (issue - 577) tried multiple systems and configuration as well but we could not reproduce the issue even once.

I would like to know is there any background application running on your system, anything specific setting that you are using (multi monitor, resolution, and any specific steps after reboot or any specific startup apps) specific to your remote setup so we could try it out and check if issue is reproduced at our end.

yetdragon commented 1 week ago

I've tried to mitigate the issue in the past by unplugging every monitor and turning off every background application in the system tray, which made my machine virtually a headless server. However, it still crashed from time to time, so it must be some kind of application issue. I can't think of any graphics-related program that would cause this other than Intel ones, like Arc Control Panel or Intel Graphics Command Center.

Maybe it's some kind of idle mode power issue. I did enable ASPM and set the Link State Power Management to maximum. Other than that, I can't think of anything else.

I'll let you know if I find anything more.

Vivek-Intel commented 1 week ago

Thanks, meanwhile I will give more try with headless combination, toggling power settings.