Error GFX engine hang detected during world switch

dhscq commented 5 years ago

Hi everyone,

I have 2 VMs running with S7150v GPU. If I destroy one VM, the GIM kernel module logs a warning message 'GFX engine hang detected', and some errors occur then. like belows:

kernel: gim debug:(idle_vf:1781) submit IDLE_GPU command to ADP 1, VF 1,#011RPTR = 0x00005488, WPTR = 0x00000000 kernel: gim debug:(wait_cmd_complete:1646) Cmd/Status @ 4 = 0x20100507 kernel: gim debug:(wait_cmd_complete:1648) RLC_GPM_STAT = 0x00000017 before waiting kernel: gim warning:(wait_cmd_complete:1671) GFX engine hang detected kernel: gim error:(wait_cmd_complete:1681) wait_cmd_complete -- time out after 0.101896997 sec kernel: gim error:(wait_cmd_complete:1688) Cmd = 0x1, Status = 0xe kernel: gim error:(dump_gpu_status:1420) dump gpu status begin for struct adapter 4:00.00 kernel: gim info:(check_base_addrs:1408) CP_MQD_BASE_ADDR = 0xf4:0f9ff000 kernel: gim error:(dump_gpu_status:1427) CP Ring buffer is not empty, kernel: gim error:(dump_gpu_status:1428) RPTR = 0x00005488, WPTR = 0x00000000 kernel: gim error:(dump_gpu_status:1430) When IDLE_GPU was sent RPTR = 0x00005488,#011WPTR = 0x00000000 kernel: gim warning:(ring_is_empty:1272) CP_RB_WPTR (0x00000000) != CP_RB_RPTR (0x00005488) kernel: gim error:(dump_gpu_status:1434) At least one ring is active kernel: gim error:(dump_gpu_status:1457) mmGRBM_STATUS = 0xa0003028 kernel: gim error:(dump_gpu_status:1460) mmGRBM_STATUS2 = 0x71000008 kernel: gim error:(dump_gpu_status:1463) mmSRBM_STATUS = 0x20020040 kernel: gim error:(dump_gpu_status:1466) mmSRBM_STATUS2 = 0x0 kernel: gim error:(dump_gpu_status:1469) mmSDMA0_STATUS_REG = 0x46deed57 kernel: gim error:(dump_gpu_status:1472) mmSDMA1_STATUS_REG = 0x46deed57 kernel: gim error:(dump_gpu_status:1486) CP busy kernel: gim error:(dump_gpu_status:1491) RLC busy kernel: gim error:(dump_gpu_status:1521) CP busy kernel: gim error:(dump_gpu_status:1563) CP_CPF_STATUS = 0xb4000223 kernel: gim error:(dump_gpu_status:1565) The write pointer has been updated and kernel: gim error:(dump_gpu_status:1566) the initiated work is still being processed kernel: gim error:(dump_gpu_status:1567) by the GFX pipe kernel: gim info:(check_me_cntl:1396) ME/PFP/CE running GPU dump kernel: gim error:(dump_gpu_status:1583) CP_CPF_BUSY_STAT = 0x00000002 kernel: gim error:(dump_gpu_status:1588) dump gpu status end kernel: gim error:(world_switch:3005) Schedule VF1 to VF1 failed;Failure reason is 6, try to reset kernel: gim info:(gim_notify_reset_per_vf:4143) Notify reset to VF1 kernel: gim info:(mailbox_update_index:836) write mmMAILBOX_INDEX: 0x1 kernel: gim info:(mailbox_notify_flr:978) write mmMAILBOX_MSGBUF_TRN_DW0: 0x2 kernel: gim info:(mailbox_notify_flr:986) write mmMAILBOX_CONTROL: 0x1 kernel: gim info:(kcl_thread_sleep:135) wait 10.000ms

any ideas to handle this error? :)

dhscq commented 5 years ago

Forgot to write down the environment.. Host: CentOS7.4 with kernel 4.14.37-4.el7.x86_64 hypervisor: qemu-kvm-ev-2.6.0-28 :)

vigchand2705 commented 5 years ago

This is expected when a VM is destroyed as there is no notification to the VF or the host driver that the VM no longer exists. The host driver will perform a Function Level Reset for this VF to recover from the hang.

These errors can be safely ignored if the reset is successful, i.e. the VM can be started again and is able to run graphics applications etc.

dhscq commented 5 years ago

This is expected when a VM is destroyed as there is no notification to the VF or the host driver that the VM no longer exists. The host driver will perform a Function Level Reset for this VF to recover from the hang.

These errors can be safely ignored if the reset is successful, i.e. the VM can be started again and is able to run graphics applications etc.

I've tested sereval days and it indeed worked safely as you said. Thanks for ur reply, really appreciate :)

GPUOpen-LibrariesAndSDKs / MxGPU-Virtualization

Error GFX engine hang detected during world switch #17