[Pytorch] Tensor tutorial examples hang

csuji commented 5 years ago

🐛 Bug

Trying to execute this examples from Pytorch tutorial (https://pytorch.org/tutorials/beginner/pytorch_with_examples.html#pytorch-tensors, https://pytorch.org/tutorials/beginner/pytorch_with_examples.html#pytorch-defining-new-autograd-functions) hangs after first iteration. Others like this work https://pytorch.org/tutorials/beginner/pytorch_with_examples.html#pytorch-tensors-and-autograd

# -*- coding: utf-8 -*-

import torch

dtype = torch.float
device = torch.device("cuda:0")
# device = torch.device("cuda:0") # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Randomly initialize weights
w1 = torch.randn(D_in, H, device=device, dtype=dtype)
w2 = torch.randn(H, D_out, device=device, dtype=dtype)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y
    h = x.mm(w1)
    h_relu = h.clamp(min=0)
    y_pred = h_relu.mm(w2)

    # Compute and print loss
    loss = (y_pred - y).pow(2).sum().item()
    print(t, loss)

    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.t().mm(grad_y_pred)
    grad_h_relu = grad_y_pred.mm(w2.t())
    grad_h = grad_h_relu.clone()
    grad_h[h < 0] = 0
    grad_w1 = x.t().mm(grad_h)

    # Update weights using gradient descent
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

Environment


- pytorch build on top of docker image rocm/pytorch:rocm1.9.2
- PyTorch version: 1.0.0a0+ee1f7b8
- OS: Ubuntu 18.04.1 LTS
- GPU VegaFE, amdgpu, 1.9-307, 4.15.0-42-generic, x86_64: installed
- CMake version: version 3.6.3
- Python version: 2.7

jedwards-AMD commented 5 years ago

Hi. Is the hang a hard hang or can the process be killed? Also, please provide the output of the following commands during the hang: . dmesg | grep amdgpu dmesg | grep amdkfd .

csuji commented 5 years ago

In the meantime I upgraded to rocm 2.0 and rebuild pytorch 1.0.0a0+017503c, same problem: amdgpu, 2.0-89, 4.15.0-43-generic, x86_64: installed GPU fan is getting loud while there is only output of the first iteration in console and it is still possible to kill the process with kill PID (Ctrl-C does not work). So there is some calculation ongoing?!

Output of dmesg|grep amdgpu:

[ 1.131352] [drm] amdgpu kernel modesetting enabled. [ 1.131353] [drm] amdgpu version: 19.10.0.418 [ 1.132585] fb: switching to amdgpudrmfb from VESA VGA [ 1.132922] amdgpu 0000:03:00.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff [ 1.132959] amdgpu 0000:03:00.0: VRAM: 16368M 0x000000F400000000 - 0x000000F7FEFFFFFF (16368M used) [ 1.132960] amdgpu 0000:03:00.0: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF [ 1.132961] amdgpu 0000:03:00.0: AGP: 267419648M 0x000000F800000000 - 0x0000FFFFFFFFFFFF [ 1.133056] [drm] amdgpu: 16368M of VRAM memory ready [ 1.133056] [drm] amdgpu: 16368M of GTT memory ready. [ 1.678594] fbcon: amdgpudrmfb (fb0) is primary device [ 1.678642] amdgpu 0000:03:00.0: fb0: amdgpudrmfb frame buffer device [ 1.696185] amdgpu 0000:03:00.0: ring gfx uses VM inv eng 4 on hub 0 [ 1.696186] amdgpu 0000:03:00.0: ring comp_1.0.0 uses VM inv eng 5 on hub 0 [ 1.696187] amdgpu 0000:03:00.0: ring comp_1.1.0 uses VM inv eng 6 on hub 0 [ 1.696187] amdgpu 0000:03:00.0: ring comp_1.2.0 uses VM inv eng 7 on hub 0 [ 1.696188] amdgpu 0000:03:00.0: ring comp_1.3.0 uses VM inv eng 8 on hub 0 [ 1.696189] amdgpu 0000:03:00.0: ring comp_1.0.1 uses VM inv eng 9 on hub 0 [ 1.696190] amdgpu 0000:03:00.0: ring comp_1.1.1 uses VM inv eng 10 on hub 0 [ 1.696190] amdgpu 0000:03:00.0: ring comp_1.2.1 uses VM inv eng 11 on hub 0 [ 1.696191] amdgpu 0000:03:00.0: ring comp_1.3.1 uses VM inv eng 12 on hub 0 [ 1.696192] amdgpu 0000:03:00.0: ring kiq_2.1.0 uses VM inv eng 13 on hub 0 [ 1.696192] amdgpu 0000:03:00.0: ring sdma0 uses VM inv eng 4 on hub 1 [ 1.696193] amdgpu 0000:03:00.0: ring page0 uses VM inv eng 5 on hub 1 [ 1.696194] amdgpu 0000:03:00.0: ring sdma1 uses VM inv eng 6 on hub 1 [ 1.696194] amdgpu 0000:03:00.0: ring page1 uses VM inv eng 7 on hub 1 [ 1.696195] amdgpu 0000:03:00.0: ring uvd<0> uses VM inv eng 8 on hub 1 [ 1.696196] amdgpu 0000:03:00.0: ring uvd_enc0<0> uses VM inv eng 9 on hub 1 [ 1.696197] amdgpu 0000:03:00.0: ring uvd_enc1<0> uses VM inv eng 10 on hub 1 [ 1.696198] amdgpu 0000:03:00.0: ring vce0 uses VM inv eng 11 on hub 1 [ 1.696198] amdgpu 0000:03:00.0: ring vce1 uses VM inv eng 12 on hub 1 [ 1.696199] amdgpu 0000:03:00.0: ring vce2 uses VM inv eng 13 on hub 1 [ 1.696739] [drm] Initialized amdgpu 3.27.0 20150101 for 0000:03:00.0 on minor 0

No output for dmesg | grep amdkfd.

iotamudelta commented 5 years ago

@csuji Can you test with our ROCm 2.1 docker images - we fixed a hang bug in that release. Thanks!

csuji commented 5 years ago

@iotamudelta Ok, thanks! Tested with docker image rocm/pytorch:rocm2.1_ubuntu16.04_pytorch_gfx900. Works now!

ROCm / pytorch

[Pytorch] Tensor tutorial examples hang #333

🐛 Bug

Environment