Open YumingChang02 opened 1 year ago
running imagenet main.py in pytorch examples github link
=> creating model 'mobilenet_v3_small' Epoch: [0][ 1/1424] Time 60.000 (60.000) Data 15.133 (15.133) Loss 6.9078e+00 (6.9078e+00) Acc@1 0.00 ( 0.00) Acc@5 0.33 ( 0.33) Epoch: [0][ 21/1424] Time 1.175 ( 4.022) Data 0.001 ( 0.722) Loss 6.8997e+00 (6.9054e+00) Acc@1 0.33 ( 0.11) Acc@5 0.78 ( 0.60) Epoch: [0][ 41/1424] Time 1.186 ( 2.652) Data 0.001 ( 0.384) Loss 6.8966e+00 (6.9001e+00) Acc@1 0.00 ( 0.17) Acc@5 0.56 ( 0.70) Epoch: [0][ 61/1424] Time 1.205 ( 2.193) Data 0.001 ( 0.269) Loss 6.8423e+00 (6.8909e+00) Acc@1 0.22 ( 0.17) Acc@5 1.33 ( 0.75) Epoch: [0][ 81/1424] Time 1.188 ( 1.951) Data 0.001 ( 0.209) Loss 6.8351e+00 (6.8794e+00) Acc@1 0.00 ( 0.16) Acc@5 1.00 ( 0.80) Epoch: [0][ 101/1424] Time 1.199 ( 1.805) Data 0.001 ( 0.174) Loss 6.8079e+00 (6.8674e+00) Acc@1 0.44 ( 0.17) Acc@5 1.11 ( 0.86) Epoch: [0][ 121/1424] Time 1.168 ( 1.706) Data 0.001 ( 0.149) Loss 6.7743e+00 (6.8548e+00) Acc@1 0.22 ( 0.18) Acc@5 1.78 ( 0.89) Epoch: [0][ 141/1424] Time 1.234 ( 1.636) Data 0.001 ( 0.132) Loss 6.7362e+00 (6.8401e+00) Acc@1 0.00 ( 0.20) Acc@5 1.33 ( 0.96) Epoch: [0][ 161/1424] Time 1.221 ( 1.583) Data 0.001 ( 0.119) Loss 6.7279e+00 (6.8240e+00) Acc@1 0.33 ( 0.23) Acc@5 1.11 ( 1.06) Epoch: [0][ 181/1424] Time 1.216 ( 1.546) Data 0.001 ( 0.109) Loss 6.6291e+00 (6.8059e+00) Acc@1 0.89 ( 0.27) Acc@5 2.11 ( 1.16) Epoch: [0][ 201/1424] Time 1.201 ( 1.513) Data 0.001 ( 0.101) Loss 6.5352e+00 (6.7850e+00) Acc@1 0.78 ( 0.29) Acc@5 2.89 ( 1.27) Epoch: [0][ 221/1424] Time 1.207 ( 1.485) Data 0.001 ( 0.095) Loss 6.4816e+00 (6.7620e+00) Acc@1 0.89 ( 0.32) Acc@5 3.00 ( 1.38) Epoch: [0][ 241/1424] Time 1.219 ( 1.462) Data 0.001 ( 0.089) Loss 6.3946e+00 (6.7378e+00) Acc@1 1.11 ( 0.35) Acc@5 3.56 ( 1.48) Epoch: [0][ 261/1424] Time 1.209 ( 1.442) Data 0.001 ( 0.084) Loss 6.3972e+00 (6.7125e+00) Acc@1 0.89 ( 0.38) Acc@5 2.78 ( 1.61) Epoch: [0][ 281/1424] Time 1.234 ( 1.425) Data 0.001 ( 0.080) Loss 6.3541e+00 (6.6868e+00) Acc@1 0.67 ( 0.41) Acc@5 3.56 ( 1.78) Epoch: [0][ 301/1424] Time 1.174 ( 1.411) Data 0.001 ( 0.077) Loss 6.2194e+00 (6.6605e+00) Acc@1 1.33 ( 0.46) Acc@5 4.89 ( 1.94) Epoch: [0][ 321/1424] Time 1.230 ( 1.399) Data 0.001 ( 0.074) Loss 6.2427e+00 (6.6346e+00) Acc@1 1.00 ( 0.49) Acc@5 4.00 ( 2.11) Epoch: [0][ 341/1424] Time 1.202 ( 1.387) Data 0.001 ( 0.071) Loss 6.1785e+00 (6.6091e+00) Acc@1 1.00 ( 0.54) Acc@5 4.56 ( 2.28) Epoch: [0][ 361/1424] Time 1.215 ( 1.377) Data 0.001 ( 0.069) Loss 6.1027e+00 (6.5836e+00) Acc@1 1.78 ( 0.58) Acc@5 6.44 ( 2.45) Epoch: [0][ 381/1424] Time 1.175 ( 1.369) Data 0.001 ( 0.067) Loss 6.1400e+00 (6.5580e+00) Acc@1 1.22 ( 0.62) Acc@5 5.22 ( 2.63) Epoch: [0][ 401/1424] Time 1.216 ( 1.361) Data 0.001 ( 0.065) Loss 6.0648e+00 (6.5339e+00) Acc@1 1.89 ( 0.67) Acc@5 7.22 ( 2.82) Epoch: [0][ 421/1424] Time 1.192 ( 1.353) Data 0.001 ( 0.063) Loss 6.0373e+00 (6.5096e+00) Acc@1 2.33 ( 0.73) Acc@5 7.56 ( 2.99) Epoch: [0][ 441/1424] Time 1.213 ( 1.347) Data 0.001 ( 0.061) Loss 5.9490e+00 (6.4868e+00) Acc@1 2.11 ( 0.78) Acc@5 7.00 ( 3.17) Epoch: [0][ 461/1424] Time 1.209 ( 1.341) Data 0.001 ( 0.060) Loss 5.8554e+00 (6.4637e+00) Acc@1 2.89 ( 0.83) Acc@5 8.67 ( 3.36) Epoch: [0][ 481/1424] Time 1.181 ( 1.335) Data 0.001 ( 0.059) Loss 5.8994e+00 (6.4411e+00) Acc@1 2.00 ( 0.89) Acc@5 7.89 ( 3.55) Epoch: [0][ 501/1424] Time 1.195 ( 1.331) Data 0.001 ( 0.058) Loss 5.8446e+00 (6.4185e+00) Acc@1 2.67 ( 0.94) Acc@5 8.89 ( 3.75) Epoch: [0][ 521/1424] Time 1.124 ( 1.326) Data 0.001 ( 0.057) Loss 5.8407e+00 (6.3959e+00) Acc@1 2.33 ( 1.00) Acc@5 7.33 ( 3.93) Epoch: [0][ 541/1424] Time 1.214 ( 1.322) Data 0.001 ( 0.055) Loss 5.7237e+00 (6.3733e+00) Acc@1 2.78 ( 1.06) Acc@5 11.22 ( 4.15) Epoch: [0][ 561/1424] Time 1.205 ( 1.318) Data 0.001 ( 0.054) Loss 5.7182e+00 (6.3512e+00) Acc@1 3.11 ( 1.13) Acc@5 10.11 ( 4.37) Epoch: [0][ 581/1424] Time 1.190 ( 1.314) Data 0.001 ( 0.054) Loss 5.6628e+00 (6.3281e+00) Acc@1 3.78 ( 1.20) Acc@5 11.89 ( 4.60) Epoch: [0][ 601/1424] Time 1.224 ( 1.310) Data 0.001 ( 0.053) Loss 5.6361e+00 (6.3064e+00) Acc@1 3.78 ( 1.27) Acc@5 11.33 ( 4.82) Epoch: [0][ 621/1424] Time 1.188 ( 1.307) Data 0.001 ( 0.052) Loss 5.6024e+00 (6.2845e+00) Acc@1 3.11 ( 1.33) Acc@5 10.78 ( 5.04) Epoch: [0][ 641/1424] Time 1.190 ( 1.304) Data 0.001 ( 0.051) Loss 5.4781e+00 (6.2628e+00) Acc@1 4.11 ( 1.40) Acc@5 14.00 ( 5.27) Epoch: [0][ 661/1424] Time 1.168 ( 1.301) Data 0.001 ( 0.050) Loss 5.5298e+00 (6.2410e+00) Acc@1 3.67 ( 1.48) Acc@5 11.78 ( 5.51) Epoch: [0][ 681/1424] Time 1.179 ( 1.298) Data 0.001 ( 0.050) Loss 5.5839e+00 (6.2199e+00) Acc@1 3.33 ( 1.56) Acc@5 12.89 ( 5.75) Epoch: [0][ 701/1424] Time 1.175 ( 1.296) Data 0.001 ( 0.049) Loss 5.4905e+00 (6.2003e+00) Acc@1 3.78 ( 1.63) Acc@5 13.44 ( 5.97) Epoch: [0][ 721/1424] Time 1.209 ( 1.294) Data 0.001 ( 0.049) Loss 5.4298e+00 (6.1797e+00) Acc@1 4.11 ( 1.71) Acc@5 13.78 ( 6.20) Epoch: [0][ 741/1424] Time 1.187 ( 1.292) Data 0.001 ( 0.048) Loss 5.3948e+00 (6.1603e+00) Acc@1 4.56 ( 1.79) Acc@5 14.22 ( 6.43) Epoch: [0][ 761/1424] Time 1.204 ( 1.289) Data 0.001 ( 0.048) Loss 5.3875e+00 (6.1405e+00) Acc@1 5.56 ( 1.87) Acc@5 14.67 ( 6.67) Epoch: [0][ 781/1424] Time 1.208 ( 1.287) Data 0.001 ( 0.047) Loss 5.3482e+00 (6.1205e+00) Acc@1 5.78 ( 1.96) Acc@5 17.44 ( 6.91) Epoch: [0][ 801/1424] Time 1.221 ( 1.285) Data 0.001 ( 0.047) Loss 5.2558e+00 (6.1018e+00) Acc@1 4.67 ( 2.04) Acc@5 16.33 ( 7.14) Epoch: [0][ 821/1424] Time 1.197 ( 1.283) Data 0.001 ( 0.046) Loss 5.3466e+00 (6.0833e+00) Acc@1 6.00 ( 2.12) Acc@5 16.89 ( 7.37) Epoch: [0][ 841/1424] Time 1.194 ( 1.282) Data 0.001 ( 0.046) Loss 5.2687e+00 (6.0649e+00) Acc@1 5.33 ( 2.20) Acc@5 17.44 ( 7.59) Epoch: [0][ 861/1424] Time 1.217 ( 1.280) Data 0.001 ( 0.045) Loss 5.3550e+00 (6.0473e+00) Acc@1 5.78 ( 2.28) Acc@5 14.56 ( 7.81) Epoch: [0][ 881/1424] Time 1.235 ( 1.278) Data 0.001 ( 0.045) Loss 5.3109e+00 (6.0297e+00) Acc@1 6.00 ( 2.37) Acc@5 18.00 ( 8.04) Epoch: [0][ 901/1424] Time 1.212 ( 1.277) Data 0.001 ( 0.045) Loss 5.3358e+00 (6.0127e+00) Acc@1 6.44 ( 2.45) Acc@5 18.33 ( 8.26) Epoch: [0][ 921/1424] Time 1.180 ( 1.275) Data 0.001 ( 0.044) Loss 5.2877e+00 (5.9953e+00) Acc@1 5.89 ( 2.53) Acc@5 18.11 ( 8.50) Epoch: [0][ 941/1424] Time 1.211 ( 1.274) Data 0.001 ( 0.044) Loss 5.2182e+00 (5.9784e+00) Acc@1 6.44 ( 2.62) Acc@5 18.33 ( 8.72) Epoch: [0][ 961/1424] Time 1.231 ( 1.272) Data 0.001 ( 0.044) Loss 5.1222e+00 (5.9617e+00) Acc@1 7.89 ( 2.70) Acc@5 20.89 ( 8.95) Epoch: [0][ 981/1424] Time 1.198 ( 1.271) Data 0.001 ( 0.043) Loss 5.1630e+00 (5.9452e+00) Acc@1 7.67 ( 2.79) Acc@5 20.33 ( 9.17) Epoch: [0][1001/1424] Time 1.230 ( 1.270) Data 0.001 ( 0.043) Loss 5.0712e+00 (5.9293e+00) Acc@1 7.33 ( 2.87) Acc@5 19.56 ( 9.38) Epoch: [0][1021/1424] Time 1.192 ( 1.269) Data 0.001 ( 0.043) Loss 5.0792e+00 (5.9138e+00) Acc@1 7.00 ( 2.95) Acc@5 20.00 ( 9.61) Epoch: [0][1041/1424] Time 1.183 ( 1.268) Data 0.001 ( 0.042) Loss 5.1826e+00 (5.8981e+00) Acc@1 6.22 ( 3.03) Acc@5 19.33 ( 9.82) Epoch: [0][1061/1424] Time 1.208 ( 1.267) Data 0.001 ( 0.042) Loss 5.1046e+00 (5.8823e+00) Acc@1 7.56 ( 3.12) Acc@5 22.67 ( 10.05) Epoch: [0][1081/1424] Time 1.224 ( 1.266) Data 0.001 ( 0.042) Loss 5.0414e+00 (5.8669e+00) Acc@1 9.00 ( 3.20) Acc@5 23.22 ( 10.26) Epoch: [0][1101/1424] Time 1.188 ( 1.265) Data 0.001 ( 0.042) Loss 5.0559e+00 (5.8515e+00) Acc@1 8.33 ( 3.29) Acc@5 20.67 ( 10.48) Epoch: [0][1121/1424] Time 1.216 ( 1.265) Data 0.001 ( 0.041) Loss 5.0840e+00 (5.8369e+00) Acc@1 7.44 ( 3.38) Acc@5 22.22 ( 10.70) Epoch: [0][1141/1424] Time 1.223 ( 1.263) Data 0.001 ( 0.041) Loss 4.9830e+00 (5.8229e+00) Acc@1 8.78 ( 3.46) Acc@5 21.56 ( 10.89) Epoch: [0][1161/1424] Time 1.173 ( 1.262) Data 0.001 ( 0.041) Loss 4.9043e+00 (5.8086e+00) Acc@1 8.11 ( 3.54) Acc@5 22.78 ( 11.11) Epoch: [0][1181/1424] Time 1.217 ( 1.262) Data 0.001 ( 0.041) Loss 4.9920e+00 (5.7944e+00) Acc@1 8.44 ( 3.62) Acc@5 23.44 ( 11.32) Epoch: [0][1201/1424] Time 1.208 ( 1.261) Data 0.001 ( 0.041) Loss 4.9523e+00 (5.7801e+00) Acc@1 9.00 ( 3.71) Acc@5 21.78 ( 11.53) Epoch: [0][1221/1424] Time 1.286 ( 1.260) Data 0.001 ( 0.040) Loss 4.9660e+00 (5.7661e+00) Acc@1 8.33 ( 3.80) Acc@5 23.56 ( 11.73) Epoch: [0][1241/1424] Time 1.171 ( 1.259) Data 0.001 ( 0.040) Loss 4.9641e+00 (5.7521e+00) Acc@1 8.67 ( 3.89) Acc@5 23.00 ( 11.94) Epoch: [0][1261/1424] Time 1.212 ( 1.259) Data 0.001 ( 0.040) Loss 5.0178e+00 (5.7388e+00) Acc@1 8.78 ( 3.97) Acc@5 24.33 ( 12.14) Epoch: [0][1281/1424] Time 1.203 ( 1.258) Data 0.001 ( 0.040) Loss 4.9352e+00 (5.7259e+00) Acc@1 8.78 ( 4.06) Acc@5 24.67 ( 12.34) Epoch: [0][1301/1424] Time 1.186 ( 1.257) Data 0.001 ( 0.040) Loss 4.8461e+00 (5.7127e+00) Acc@1 11.67 ( 4.14) Acc@5 27.22 ( 12.54) Epoch: [0][1321/1424] Time 1.209 ( 1.256) Data 0.001 ( 0.039) Loss 4.9172e+00 (5.6998e+00) Acc@1 10.11 ( 4.23) Acc@5 23.44 ( 12.73) Epoch: [0][1341/1424] Time 1.186 ( 1.256) Data 0.001 ( 0.039) Loss 4.7432e+00 (5.6869e+00) Acc@1 12.00 ( 4.31) Acc@5 28.56 ( 12.93) Epoch: [0][1361/1424] Time 1.217 ( 1.255) Data 0.001 ( 0.039) Loss 4.8297e+00 (5.6744e+00) Acc@1 10.89 ( 4.40) Acc@5 24.89 ( 13.12) Epoch: [0][1381/1424] Time 1.203 ( 1.255) Data 0.001 ( 0.039) Loss 4.9124e+00 (5.6622e+00) Acc@1 9.00 ( 4.48) Acc@5 25.22 ( 13.31) Epoch: [0][1401/1424] Time 1.215 ( 1.254) Data 0.000 ( 0.039) Loss 4.8593e+00 (5.6498e+00) Acc@1 10.56 ( 4.57) Acc@5 25.22 ( 13.51) Epoch: [0][1421/1424] Time 1.271 ( 1.253) Data 0.000 ( 0.039) Loss 4.7977e+00 (5.6378e+00) Acc@1 10.33 ( 4.65) Acc@5 27.56 ( 13.70) Test: [ 1/56] Time 19.546 (19.546) Loss 4.2155e+00 (4.2155e+00) Acc@1 16.11 ( 16.11) Acc@5 44.11 ( 44.11) Test: [21/56] Time 0.352 ( 1.719) Loss 5.6829e+00 (4.9246e+00) Acc@1 2.56 ( 8.41) Acc@5 10.33 ( 24.26) Test: [41/56] Time 0.345 ( 1.552) Loss 4.9785e+00 (4.9935e+00) Acc@1 8.44 ( 8.10) Acc@5 22.11 ( 23.13) * Acc@1 8.258 Acc@5 23.380 Epoch: [1][ 1/1424] Time 17.457 (17.457) Data 15.719 (15.719) Loss 4.8509e+00 (4.8509e+00) Acc@1 10.00 ( 10.00) Acc@5 25.00 ( 25.00) Epoch: [1][ 21/1424] Time 1.705 ( 2.485) Data 0.001 ( 0.770) Loss 4.9256e+00 (4.7646e+00) Acc@1 9.78 ( 10.74) Acc@5 24.33 ( 27.21) Epoch: [1][ 41/1424] Time 1.741 ( 2.127) Data 0.001 ( 0.419) Loss 4.7073e+00 (4.7602e+00) Acc@1 11.22 ( 10.82) Acc@5 27.89 ( 27.44) Epoch: [1][ 61/1424] Time 1.709 ( 2.004) Data 0.001 ( 0.297) Loss 4.7274e+00 (4.7496e+00) Acc@1 10.67 ( 11.00) Acc@5 29.33 ( 27.64) Epoch: [1][ 81/1424] Time 1.739 ( 1.940) Data 0.001 ( 0.236) Loss 4.6097e+00 (4.7386e+00) Acc@1 11.56 ( 11.05) Acc@5 30.00 ( 27.81)
during the run, there is a kernel message
[Wed Dec 7 14:44:07 2022] amdgpu: Runlist is getting oversubscribed. Expect reduced ROCm performance. [Wed Dec 7 14:44:09 2022] amdgpu: Runlist is getting oversubscribed. Expect reduced ROCm performance.
System info with inxi ( host machine, running rocm/pytorch:latest )
markchang@X99-TF-8 ~> inxi -Fnx System: Host: X99-TF-8 Kernel: 6.0.11-arch1-1 arch: x86_64 bits: 64 compiler: gcc v: 12.2.0 Console: pty pts/5 Distro: Arch Linux Machine: Type: Desktop System: HUANANZHI product: N/A v: N/A serial: <superuser required> Mobo: HUANANZHI model: X99-TF-Q GAMING v: V1.2 serial: <superuser required> UEFI: American Megatrends v: 5.11 date: 07/06/2022 CPU: Info: 12-core model: Intel Xeon E5-2673 v3 bits: 64 type: MT MCP arch: Haswell rev: 2 cache: L1: 768 KiB L2: 3 MiB L3: 30 MiB Speed (MHz): avg: 2136 high: 3100 min/max: 1200/3100 cores: 1: 2351 2: 2694 3: 1200 4: 2295 5: 2399 6: 2694 7: 2394 8: 2395 9: 1197 10: 3100 11: 2195 12: 2694 13: 2300 14: 2694 15: 1199 16: 2294 17: 2394 18: 2000 19: 2394 20: 2394 21: 1200 22: 1200 23: 1200 24: 2409 bogomips: 114965 Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx Graphics: Device-1: AMD Navi 23 [Radeon RX 6600/6600 XT/6600M] vendor: Tul / PowerColor driver: amdgpu v: kernel arch: RDNA-2 bus-ID: 06:00.0 Device-2: AMD Navi 23 [Radeon RX 6600/6600 XT/6600M] vendor: Tul / PowerColor driver: amdgpu v: kernel arch: RDNA-2 bus-ID: 09:00.0
Collecting environment information... PyTorch version: 1.13.0a0+git941769a Is debug build: False CUDA used to build PyTorch: N/A ROCM used to build PyTorch: 5.4.22801-aaa1e3d8 OS: Ubuntu 20.04.4 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 Clang version: 15.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.4.0 22465 d6f0fe8b22e3d8ce0f2cbd657ea14b16043018a5) CMake version: version 3.22.1 Libc version: glibc-2.31 Python version: 3.8.13 (default, Mar 28 2022, 11:38:47) [GCC 7.5.0] (64-bit runtime) Python platform: Linux-6.0.11-arch1-1-x86_64-with-glibc2.17 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: GPU models and configuration: AMD Radeon RX 6600 Nvidia driver version: Could not collect cuDNN version: Could not collect HIP runtime version: 5.4.22801 MIOpen runtime version: 2.19.0 Is XNNPACK available: True Versions of relevant libraries: [pip3] mypy==0.960 [pip3] mypy-extensions==0.4.3 [pip3] numpy==1.22.4 [pip3] torch==1.13.0a0+git941769a [pip3] torchvision==0.14.0a0+bd70a78 [conda] mkl 2022.0.1 h06a4308_117 [conda] mkl-include 2022.0.1 h06a4308_117 [conda] numpy 1.22.4 pypi_0 pypi [conda] torch 1.13.0a0+git941769a pypi_0 pypi [conda] torchvision 0.14.0a0+bd70a78 pypi_0 pypi
Hi @YumingChang02 , can you watch for your GPU operation temperature after the first epoch? if it's getting to high you might experience slower performance. watch -n 0.1 rocm-smi
watch -n 0.1 rocm-smi
🐛 Describe the bug
running imagenet main.py in pytorch examples github link
during the run, there is a kernel message
System info with inxi ( host machine, running rocm/pytorch:latest )
Versions
note this is running in docker ( rocm/pytorch:latest )