ROCm / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
http://pytorch.org
Other
220 stars 51 forks source link

Degreded preformance after first epoch #1147

Open YumingChang02 opened 1 year ago

YumingChang02 commented 1 year ago

🐛 Describe the bug

running imagenet main.py in pytorch examples github link

=> creating model 'mobilenet_v3_small'
Epoch: [0][   1/1424]   Time 60.000 (60.000)    Data 15.133 (15.133)    Loss 6.9078e+00 (6.9078e+00)    Acc@1   0.00 (  0.00)   Acc@5   0.33 (  0.33)
Epoch: [0][  21/1424]   Time  1.175 ( 4.022)    Data  0.001 ( 0.722)    Loss 6.8997e+00 (6.9054e+00)    Acc@1   0.33 (  0.11)   Acc@5   0.78 (  0.60)
Epoch: [0][  41/1424]   Time  1.186 ( 2.652)    Data  0.001 ( 0.384)    Loss 6.8966e+00 (6.9001e+00)    Acc@1   0.00 (  0.17)   Acc@5   0.56 (  0.70)
Epoch: [0][  61/1424]   Time  1.205 ( 2.193)    Data  0.001 ( 0.269)    Loss 6.8423e+00 (6.8909e+00)    Acc@1   0.22 (  0.17)   Acc@5   1.33 (  0.75)
Epoch: [0][  81/1424]   Time  1.188 ( 1.951)    Data  0.001 ( 0.209)    Loss 6.8351e+00 (6.8794e+00)    Acc@1   0.00 (  0.16)   Acc@5   1.00 (  0.80)
Epoch: [0][ 101/1424]   Time  1.199 ( 1.805)    Data  0.001 ( 0.174)    Loss 6.8079e+00 (6.8674e+00)    Acc@1   0.44 (  0.17)   Acc@5   1.11 (  0.86)
Epoch: [0][ 121/1424]   Time  1.168 ( 1.706)    Data  0.001 ( 0.149)    Loss 6.7743e+00 (6.8548e+00)    Acc@1   0.22 (  0.18)   Acc@5   1.78 (  0.89)
Epoch: [0][ 141/1424]   Time  1.234 ( 1.636)    Data  0.001 ( 0.132)    Loss 6.7362e+00 (6.8401e+00)    Acc@1   0.00 (  0.20)   Acc@5   1.33 (  0.96)
Epoch: [0][ 161/1424]   Time  1.221 ( 1.583)    Data  0.001 ( 0.119)    Loss 6.7279e+00 (6.8240e+00)    Acc@1   0.33 (  0.23)   Acc@5   1.11 (  1.06)
Epoch: [0][ 181/1424]   Time  1.216 ( 1.546)    Data  0.001 ( 0.109)    Loss 6.6291e+00 (6.8059e+00)    Acc@1   0.89 (  0.27)   Acc@5   2.11 (  1.16)
Epoch: [0][ 201/1424]   Time  1.201 ( 1.513)    Data  0.001 ( 0.101)    Loss 6.5352e+00 (6.7850e+00)    Acc@1   0.78 (  0.29)   Acc@5   2.89 (  1.27)
Epoch: [0][ 221/1424]   Time  1.207 ( 1.485)    Data  0.001 ( 0.095)    Loss 6.4816e+00 (6.7620e+00)    Acc@1   0.89 (  0.32)   Acc@5   3.00 (  1.38)
Epoch: [0][ 241/1424]   Time  1.219 ( 1.462)    Data  0.001 ( 0.089)    Loss 6.3946e+00 (6.7378e+00)    Acc@1   1.11 (  0.35)   Acc@5   3.56 (  1.48)
Epoch: [0][ 261/1424]   Time  1.209 ( 1.442)    Data  0.001 ( 0.084)    Loss 6.3972e+00 (6.7125e+00)    Acc@1   0.89 (  0.38)   Acc@5   2.78 (  1.61)
Epoch: [0][ 281/1424]   Time  1.234 ( 1.425)    Data  0.001 ( 0.080)    Loss 6.3541e+00 (6.6868e+00)    Acc@1   0.67 (  0.41)   Acc@5   3.56 (  1.78)
Epoch: [0][ 301/1424]   Time  1.174 ( 1.411)    Data  0.001 ( 0.077)    Loss 6.2194e+00 (6.6605e+00)    Acc@1   1.33 (  0.46)   Acc@5   4.89 (  1.94)
Epoch: [0][ 321/1424]   Time  1.230 ( 1.399)    Data  0.001 ( 0.074)    Loss 6.2427e+00 (6.6346e+00)    Acc@1   1.00 (  0.49)   Acc@5   4.00 (  2.11)
Epoch: [0][ 341/1424]   Time  1.202 ( 1.387)    Data  0.001 ( 0.071)    Loss 6.1785e+00 (6.6091e+00)    Acc@1   1.00 (  0.54)   Acc@5   4.56 (  2.28)
Epoch: [0][ 361/1424]   Time  1.215 ( 1.377)    Data  0.001 ( 0.069)    Loss 6.1027e+00 (6.5836e+00)    Acc@1   1.78 (  0.58)   Acc@5   6.44 (  2.45)
Epoch: [0][ 381/1424]   Time  1.175 ( 1.369)    Data  0.001 ( 0.067)    Loss 6.1400e+00 (6.5580e+00)    Acc@1   1.22 (  0.62)   Acc@5   5.22 (  2.63)
Epoch: [0][ 401/1424]   Time  1.216 ( 1.361)    Data  0.001 ( 0.065)    Loss 6.0648e+00 (6.5339e+00)    Acc@1   1.89 (  0.67)   Acc@5   7.22 (  2.82)
Epoch: [0][ 421/1424]   Time  1.192 ( 1.353)    Data  0.001 ( 0.063)    Loss 6.0373e+00 (6.5096e+00)    Acc@1   2.33 (  0.73)   Acc@5   7.56 (  2.99)
Epoch: [0][ 441/1424]   Time  1.213 ( 1.347)    Data  0.001 ( 0.061)    Loss 5.9490e+00 (6.4868e+00)    Acc@1   2.11 (  0.78)   Acc@5   7.00 (  3.17)
Epoch: [0][ 461/1424]   Time  1.209 ( 1.341)    Data  0.001 ( 0.060)    Loss 5.8554e+00 (6.4637e+00)    Acc@1   2.89 (  0.83)   Acc@5   8.67 (  3.36)
Epoch: [0][ 481/1424]   Time  1.181 ( 1.335)    Data  0.001 ( 0.059)    Loss 5.8994e+00 (6.4411e+00)    Acc@1   2.00 (  0.89)   Acc@5   7.89 (  3.55)
Epoch: [0][ 501/1424]   Time  1.195 ( 1.331)    Data  0.001 ( 0.058)    Loss 5.8446e+00 (6.4185e+00)    Acc@1   2.67 (  0.94)   Acc@5   8.89 (  3.75)
Epoch: [0][ 521/1424]   Time  1.124 ( 1.326)    Data  0.001 ( 0.057)    Loss 5.8407e+00 (6.3959e+00)    Acc@1   2.33 (  1.00)   Acc@5   7.33 (  3.93)
Epoch: [0][ 541/1424]   Time  1.214 ( 1.322)    Data  0.001 ( 0.055)    Loss 5.7237e+00 (6.3733e+00)    Acc@1   2.78 (  1.06)   Acc@5  11.22 (  4.15)
Epoch: [0][ 561/1424]   Time  1.205 ( 1.318)    Data  0.001 ( 0.054)    Loss 5.7182e+00 (6.3512e+00)    Acc@1   3.11 (  1.13)   Acc@5  10.11 (  4.37)
Epoch: [0][ 581/1424]   Time  1.190 ( 1.314)    Data  0.001 ( 0.054)    Loss 5.6628e+00 (6.3281e+00)    Acc@1   3.78 (  1.20)   Acc@5  11.89 (  4.60)
Epoch: [0][ 601/1424]   Time  1.224 ( 1.310)    Data  0.001 ( 0.053)    Loss 5.6361e+00 (6.3064e+00)    Acc@1   3.78 (  1.27)   Acc@5  11.33 (  4.82)
Epoch: [0][ 621/1424]   Time  1.188 ( 1.307)    Data  0.001 ( 0.052)    Loss 5.6024e+00 (6.2845e+00)    Acc@1   3.11 (  1.33)   Acc@5  10.78 (  5.04)
Epoch: [0][ 641/1424]   Time  1.190 ( 1.304)    Data  0.001 ( 0.051)    Loss 5.4781e+00 (6.2628e+00)    Acc@1   4.11 (  1.40)   Acc@5  14.00 (  5.27)
Epoch: [0][ 661/1424]   Time  1.168 ( 1.301)    Data  0.001 ( 0.050)    Loss 5.5298e+00 (6.2410e+00)    Acc@1   3.67 (  1.48)   Acc@5  11.78 (  5.51)
Epoch: [0][ 681/1424]   Time  1.179 ( 1.298)    Data  0.001 ( 0.050)    Loss 5.5839e+00 (6.2199e+00)    Acc@1   3.33 (  1.56)   Acc@5  12.89 (  5.75)
Epoch: [0][ 701/1424]   Time  1.175 ( 1.296)    Data  0.001 ( 0.049)    Loss 5.4905e+00 (6.2003e+00)    Acc@1   3.78 (  1.63)   Acc@5  13.44 (  5.97)
Epoch: [0][ 721/1424]   Time  1.209 ( 1.294)    Data  0.001 ( 0.049)    Loss 5.4298e+00 (6.1797e+00)    Acc@1   4.11 (  1.71)   Acc@5  13.78 (  6.20)
Epoch: [0][ 741/1424]   Time  1.187 ( 1.292)    Data  0.001 ( 0.048)    Loss 5.3948e+00 (6.1603e+00)    Acc@1   4.56 (  1.79)   Acc@5  14.22 (  6.43)
Epoch: [0][ 761/1424]   Time  1.204 ( 1.289)    Data  0.001 ( 0.048)    Loss 5.3875e+00 (6.1405e+00)    Acc@1   5.56 (  1.87)   Acc@5  14.67 (  6.67)
Epoch: [0][ 781/1424]   Time  1.208 ( 1.287)    Data  0.001 ( 0.047)    Loss 5.3482e+00 (6.1205e+00)    Acc@1   5.78 (  1.96)   Acc@5  17.44 (  6.91)
Epoch: [0][ 801/1424]   Time  1.221 ( 1.285)    Data  0.001 ( 0.047)    Loss 5.2558e+00 (6.1018e+00)    Acc@1   4.67 (  2.04)   Acc@5  16.33 (  7.14)
Epoch: [0][ 821/1424]   Time  1.197 ( 1.283)    Data  0.001 ( 0.046)    Loss 5.3466e+00 (6.0833e+00)    Acc@1   6.00 (  2.12)   Acc@5  16.89 (  7.37)
Epoch: [0][ 841/1424]   Time  1.194 ( 1.282)    Data  0.001 ( 0.046)    Loss 5.2687e+00 (6.0649e+00)    Acc@1   5.33 (  2.20)   Acc@5  17.44 (  7.59)
Epoch: [0][ 861/1424]   Time  1.217 ( 1.280)    Data  0.001 ( 0.045)    Loss 5.3550e+00 (6.0473e+00)    Acc@1   5.78 (  2.28)   Acc@5  14.56 (  7.81)
Epoch: [0][ 881/1424]   Time  1.235 ( 1.278)    Data  0.001 ( 0.045)    Loss 5.3109e+00 (6.0297e+00)    Acc@1   6.00 (  2.37)   Acc@5  18.00 (  8.04)
Epoch: [0][ 901/1424]   Time  1.212 ( 1.277)    Data  0.001 ( 0.045)    Loss 5.3358e+00 (6.0127e+00)    Acc@1   6.44 (  2.45)   Acc@5  18.33 (  8.26)
Epoch: [0][ 921/1424]   Time  1.180 ( 1.275)    Data  0.001 ( 0.044)    Loss 5.2877e+00 (5.9953e+00)    Acc@1   5.89 (  2.53)   Acc@5  18.11 (  8.50)
Epoch: [0][ 941/1424]   Time  1.211 ( 1.274)    Data  0.001 ( 0.044)    Loss 5.2182e+00 (5.9784e+00)    Acc@1   6.44 (  2.62)   Acc@5  18.33 (  8.72)
Epoch: [0][ 961/1424]   Time  1.231 ( 1.272)    Data  0.001 ( 0.044)    Loss 5.1222e+00 (5.9617e+00)    Acc@1   7.89 (  2.70)   Acc@5  20.89 (  8.95)
Epoch: [0][ 981/1424]   Time  1.198 ( 1.271)    Data  0.001 ( 0.043)    Loss 5.1630e+00 (5.9452e+00)    Acc@1   7.67 (  2.79)   Acc@5  20.33 (  9.17)
Epoch: [0][1001/1424]   Time  1.230 ( 1.270)    Data  0.001 ( 0.043)    Loss 5.0712e+00 (5.9293e+00)    Acc@1   7.33 (  2.87)   Acc@5  19.56 (  9.38)
Epoch: [0][1021/1424]   Time  1.192 ( 1.269)    Data  0.001 ( 0.043)    Loss 5.0792e+00 (5.9138e+00)    Acc@1   7.00 (  2.95)   Acc@5  20.00 (  9.61)
Epoch: [0][1041/1424]   Time  1.183 ( 1.268)    Data  0.001 ( 0.042)    Loss 5.1826e+00 (5.8981e+00)    Acc@1   6.22 (  3.03)   Acc@5  19.33 (  9.82)
Epoch: [0][1061/1424]   Time  1.208 ( 1.267)    Data  0.001 ( 0.042)    Loss 5.1046e+00 (5.8823e+00)    Acc@1   7.56 (  3.12)   Acc@5  22.67 ( 10.05)
Epoch: [0][1081/1424]   Time  1.224 ( 1.266)    Data  0.001 ( 0.042)    Loss 5.0414e+00 (5.8669e+00)    Acc@1   9.00 (  3.20)   Acc@5  23.22 ( 10.26)
Epoch: [0][1101/1424]   Time  1.188 ( 1.265)    Data  0.001 ( 0.042)    Loss 5.0559e+00 (5.8515e+00)    Acc@1   8.33 (  3.29)   Acc@5  20.67 ( 10.48)
Epoch: [0][1121/1424]   Time  1.216 ( 1.265)    Data  0.001 ( 0.041)    Loss 5.0840e+00 (5.8369e+00)    Acc@1   7.44 (  3.38)   Acc@5  22.22 ( 10.70)
Epoch: [0][1141/1424]   Time  1.223 ( 1.263)    Data  0.001 ( 0.041)    Loss 4.9830e+00 (5.8229e+00)    Acc@1   8.78 (  3.46)   Acc@5  21.56 ( 10.89)
Epoch: [0][1161/1424]   Time  1.173 ( 1.262)    Data  0.001 ( 0.041)    Loss 4.9043e+00 (5.8086e+00)    Acc@1   8.11 (  3.54)   Acc@5  22.78 ( 11.11)
Epoch: [0][1181/1424]   Time  1.217 ( 1.262)    Data  0.001 ( 0.041)    Loss 4.9920e+00 (5.7944e+00)    Acc@1   8.44 (  3.62)   Acc@5  23.44 ( 11.32)
Epoch: [0][1201/1424]   Time  1.208 ( 1.261)    Data  0.001 ( 0.041)    Loss 4.9523e+00 (5.7801e+00)    Acc@1   9.00 (  3.71)   Acc@5  21.78 ( 11.53)
Epoch: [0][1221/1424]   Time  1.286 ( 1.260)    Data  0.001 ( 0.040)    Loss 4.9660e+00 (5.7661e+00)    Acc@1   8.33 (  3.80)   Acc@5  23.56 ( 11.73)
Epoch: [0][1241/1424]   Time  1.171 ( 1.259)    Data  0.001 ( 0.040)    Loss 4.9641e+00 (5.7521e+00)    Acc@1   8.67 (  3.89)   Acc@5  23.00 ( 11.94)
Epoch: [0][1261/1424]   Time  1.212 ( 1.259)    Data  0.001 ( 0.040)    Loss 5.0178e+00 (5.7388e+00)    Acc@1   8.78 (  3.97)   Acc@5  24.33 ( 12.14)
Epoch: [0][1281/1424]   Time  1.203 ( 1.258)    Data  0.001 ( 0.040)    Loss 4.9352e+00 (5.7259e+00)    Acc@1   8.78 (  4.06)   Acc@5  24.67 ( 12.34)
Epoch: [0][1301/1424]   Time  1.186 ( 1.257)    Data  0.001 ( 0.040)    Loss 4.8461e+00 (5.7127e+00)    Acc@1  11.67 (  4.14)   Acc@5  27.22 ( 12.54)
Epoch: [0][1321/1424]   Time  1.209 ( 1.256)    Data  0.001 ( 0.039)    Loss 4.9172e+00 (5.6998e+00)    Acc@1  10.11 (  4.23)   Acc@5  23.44 ( 12.73)
Epoch: [0][1341/1424]   Time  1.186 ( 1.256)    Data  0.001 ( 0.039)    Loss 4.7432e+00 (5.6869e+00)    Acc@1  12.00 (  4.31)   Acc@5  28.56 ( 12.93)
Epoch: [0][1361/1424]   Time  1.217 ( 1.255)    Data  0.001 ( 0.039)    Loss 4.8297e+00 (5.6744e+00)    Acc@1  10.89 (  4.40)   Acc@5  24.89 ( 13.12)
Epoch: [0][1381/1424]   Time  1.203 ( 1.255)    Data  0.001 ( 0.039)    Loss 4.9124e+00 (5.6622e+00)    Acc@1   9.00 (  4.48)   Acc@5  25.22 ( 13.31)
Epoch: [0][1401/1424]   Time  1.215 ( 1.254)    Data  0.000 ( 0.039)    Loss 4.8593e+00 (5.6498e+00)    Acc@1  10.56 (  4.57)   Acc@5  25.22 ( 13.51)
Epoch: [0][1421/1424]   Time  1.271 ( 1.253)    Data  0.000 ( 0.039)    Loss 4.7977e+00 (5.6378e+00)    Acc@1  10.33 (  4.65)   Acc@5  27.56 ( 13.70)
Test: [ 1/56]   Time 19.546 (19.546)    Loss 4.2155e+00 (4.2155e+00)    Acc@1  16.11 ( 16.11)   Acc@5  44.11 ( 44.11)
Test: [21/56]   Time  0.352 ( 1.719)    Loss 5.6829e+00 (4.9246e+00)    Acc@1   2.56 (  8.41)   Acc@5  10.33 ( 24.26)
Test: [41/56]   Time  0.345 ( 1.552)    Loss 4.9785e+00 (4.9935e+00)    Acc@1   8.44 (  8.10)   Acc@5  22.11 ( 23.13)
 *   Acc@1 8.258 Acc@5 23.380
Epoch: [1][   1/1424]   Time 17.457 (17.457)    Data 15.719 (15.719)    Loss 4.8509e+00 (4.8509e+00)    Acc@1  10.00 ( 10.00)   Acc@5  25.00 ( 25.00)
Epoch: [1][  21/1424]   Time  1.705 ( 2.485)    Data  0.001 ( 0.770)    Loss 4.9256e+00 (4.7646e+00)    Acc@1   9.78 ( 10.74)   Acc@5  24.33 ( 27.21)
Epoch: [1][  41/1424]   Time  1.741 ( 2.127)    Data  0.001 ( 0.419)    Loss 4.7073e+00 (4.7602e+00)    Acc@1  11.22 ( 10.82)   Acc@5  27.89 ( 27.44)
Epoch: [1][  61/1424]   Time  1.709 ( 2.004)    Data  0.001 ( 0.297)    Loss 4.7274e+00 (4.7496e+00)    Acc@1  10.67 ( 11.00)   Acc@5  29.33 ( 27.64)
Epoch: [1][  81/1424]   Time  1.739 ( 1.940)    Data  0.001 ( 0.236)    Loss 4.6097e+00 (4.7386e+00)    Acc@1  11.56 ( 11.05)   Acc@5  30.00 ( 27.81)

during the run, there is a kernel message

[Wed Dec  7 14:44:07 2022] amdgpu: Runlist is getting oversubscribed. Expect reduced ROCm performance.
[Wed Dec  7 14:44:09 2022] amdgpu: Runlist is getting oversubscribed. Expect reduced ROCm performance.

System info with inxi ( host machine, running rocm/pytorch:latest )

markchang@X99-TF-8 ~> inxi -Fnx
System:
  Host: X99-TF-8 Kernel: 6.0.11-arch1-1 arch: x86_64 bits: 64 compiler: gcc v: 12.2.0
    Console: pty pts/5 Distro: Arch Linux
Machine:
  Type: Desktop System: HUANANZHI product: N/A v: N/A serial: <superuser required>
  Mobo: HUANANZHI model: X99-TF-Q GAMING v: V1.2 serial: <superuser required>
    UEFI: American Megatrends v: 5.11 date: 07/06/2022
CPU:
  Info: 12-core model: Intel Xeon E5-2673 v3 bits: 64 type: MT MCP arch: Haswell rev: 2 cache:
    L1: 768 KiB L2: 3 MiB L3: 30 MiB
  Speed (MHz): avg: 2136 high: 3100 min/max: 1200/3100 cores: 1: 2351 2: 2694 3: 1200 4: 2295
    5: 2399 6: 2694 7: 2394 8: 2395 9: 1197 10: 3100 11: 2195 12: 2694 13: 2300 14: 2694 15: 1199
    16: 2294 17: 2394 18: 2000 19: 2394 20: 2394 21: 1200 22: 1200 23: 1200 24: 2409
    bogomips: 114965
  Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx
Graphics:
  Device-1: AMD Navi 23 [Radeon RX 6600/6600 XT/6600M] vendor: Tul / PowerColor driver: amdgpu
    v: kernel arch: RDNA-2 bus-ID: 06:00.0
  Device-2: AMD Navi 23 [Radeon RX 6600/6600 XT/6600M] vendor: Tul / PowerColor driver: amdgpu
    v: kernel arch: RDNA-2 bus-ID: 09:00.0

Versions

note this is running in docker ( rocm/pytorch:latest )

Collecting environment information...
PyTorch version: 1.13.0a0+git941769a
Is debug build: False
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: 5.4.22801-aaa1e3d8

OS: Ubuntu 20.04.4 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: 15.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.4.0 22465 d6f0fe8b22e3d8ce0f2cbd657ea14b16043018a5)
CMake version: version 3.22.1
Libc version: glibc-2.31

Python version: 3.8.13 (default, Mar 28 2022, 11:38:47)  [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-6.0.11-arch1-1-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to:
GPU models and configuration: AMD Radeon RX 6600
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: 5.4.22801
MIOpen runtime version: 2.19.0
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] mypy==0.960
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.22.4
[pip3] torch==1.13.0a0+git941769a
[pip3] torchvision==0.14.0a0+bd70a78
[conda] mkl                       2022.0.1           h06a4308_117
[conda] mkl-include               2022.0.1           h06a4308_117
[conda] numpy                     1.22.4                   pypi_0    pypi
[conda] torch                     1.13.0a0+git941769a          pypi_0    pypi
[conda] torchvision               0.14.0a0+bd70a78          pypi_0    pypi
sunway513 commented 1 year ago

Hi @YumingChang02 , can you watch for your GPU operation temperature after the first epoch? if it's getting to high you might experience slower performance.
watch -n 0.1 rocm-smi