Closed csuji closed 5 years ago
Hi. Is the hang a hard hang or can the process be killed? Also, please provide the output of the following commands during the hang: . dmesg | grep amdgpu dmesg | grep amdkfd .
In the meantime I upgraded to rocm 2.0 and rebuild pytorch 1.0.0a0+017503c, same problem: amdgpu, 2.0-89, 4.15.0-43-generic, x86_64: installed GPU fan is getting loud while there is only output of the first iteration in console and it is still possible to kill the process with kill PID (Ctrl-C does not work). So there is some calculation ongoing?!
Output of dmesg|grep amdgpu:
[ 1.131352] [drm] amdgpu kernel modesetting enabled. [ 1.131353] [drm] amdgpu version: 19.10.0.418 [ 1.132585] fb: switching to amdgpudrmfb from VESA VGA [ 1.132922] amdgpu 0000:03:00.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff [ 1.132959] amdgpu 0000:03:00.0: VRAM: 16368M 0x000000F400000000 - 0x000000F7FEFFFFFF (16368M used) [ 1.132960] amdgpu 0000:03:00.0: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF [ 1.132961] amdgpu 0000:03:00.0: AGP: 267419648M 0x000000F800000000 - 0x0000FFFFFFFFFFFF [ 1.133056] [drm] amdgpu: 16368M of VRAM memory ready [ 1.133056] [drm] amdgpu: 16368M of GTT memory ready. [ 1.678594] fbcon: amdgpudrmfb (fb0) is primary device [ 1.678642] amdgpu 0000:03:00.0: fb0: amdgpudrmfb frame buffer device [ 1.696185] amdgpu 0000:03:00.0: ring gfx uses VM inv eng 4 on hub 0 [ 1.696186] amdgpu 0000:03:00.0: ring comp_1.0.0 uses VM inv eng 5 on hub 0 [ 1.696187] amdgpu 0000:03:00.0: ring comp_1.1.0 uses VM inv eng 6 on hub 0 [ 1.696187] amdgpu 0000:03:00.0: ring comp_1.2.0 uses VM inv eng 7 on hub 0 [ 1.696188] amdgpu 0000:03:00.0: ring comp_1.3.0 uses VM inv eng 8 on hub 0 [ 1.696189] amdgpu 0000:03:00.0: ring comp_1.0.1 uses VM inv eng 9 on hub 0 [ 1.696190] amdgpu 0000:03:00.0: ring comp_1.1.1 uses VM inv eng 10 on hub 0 [ 1.696190] amdgpu 0000:03:00.0: ring comp_1.2.1 uses VM inv eng 11 on hub 0 [ 1.696191] amdgpu 0000:03:00.0: ring comp_1.3.1 uses VM inv eng 12 on hub 0 [ 1.696192] amdgpu 0000:03:00.0: ring kiq_2.1.0 uses VM inv eng 13 on hub 0 [ 1.696192] amdgpu 0000:03:00.0: ring sdma0 uses VM inv eng 4 on hub 1 [ 1.696193] amdgpu 0000:03:00.0: ring page0 uses VM inv eng 5 on hub 1 [ 1.696194] amdgpu 0000:03:00.0: ring sdma1 uses VM inv eng 6 on hub 1 [ 1.696194] amdgpu 0000:03:00.0: ring page1 uses VM inv eng 7 on hub 1 [ 1.696195] amdgpu 0000:03:00.0: ring uvd<0> uses VM inv eng 8 on hub 1 [ 1.696196] amdgpu 0000:03:00.0: ring uvd_enc0<0> uses VM inv eng 9 on hub 1 [ 1.696197] amdgpu 0000:03:00.0: ring uvd_enc1<0> uses VM inv eng 10 on hub 1 [ 1.696198] amdgpu 0000:03:00.0: ring vce0 uses VM inv eng 11 on hub 1 [ 1.696198] amdgpu 0000:03:00.0: ring vce1 uses VM inv eng 12 on hub 1 [ 1.696199] amdgpu 0000:03:00.0: ring vce2 uses VM inv eng 13 on hub 1 [ 1.696739] [drm] Initialized amdgpu 3.27.0 20150101 for 0000:03:00.0 on minor 0
No output for dmesg | grep amdkfd.
@csuji Can you test with our ROCm 2.1 docker images - we fixed a hang bug in that release. Thanks!
@iotamudelta Ok, thanks! Tested with docker image rocm/pytorch:rocm2.1_ubuntu16.04_pytorch_gfx900. Works now!
🐛 Bug
Trying to execute this examples from Pytorch tutorial (https://pytorch.org/tutorials/beginner/pytorch_with_examples.html#pytorch-tensors, https://pytorch.org/tutorials/beginner/pytorch_with_examples.html#pytorch-defining-new-autograd-functions) hangs after first iteration. Others like this work https://pytorch.org/tutorials/beginner/pytorch_with_examples.html#pytorch-tensors-and-autograd
Environment