ROCm / tensorflow-upstream

TensorFlow ROCm port
https://tensorflow.org
Apache License 2.0
684 stars 94 forks source link

CNN model locks up PC #239

Closed briansp2020 closed 5 years ago

briansp2020 commented 5 years ago

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

You can collect some of this information using our environment capture [script]

$ cat tf_env.txt

== cat /etc/issue =============================================== Linux Ryzen1800X 4.15.0-38-generic #41-Ubuntu SMP Wed Oct 10 10:59:38 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux VERSION="18.04.1 LTS (Bionic Beaver)" VERSION_ID="18.04" VERSION_CODENAME=bionic

== are we in docker ============================================= No

== compiler ===================================================== c++ (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0 Copyright (C) 2017 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

== uname -a ===================================================== Linux Ryzen1800X 4.15.0-38-generic #41-Ubuntu SMP Wed Oct 10 10:59:38 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

== check pips ===================================================

== check for virtualenv ========================================= False

== tensorflow import ============================================ Traceback (most recent call last): File "", line 1, in ImportError: No module named tensorflow

== env ========================================================== LD_LIBRARY_PATH is unset DYLD_LIBRARY_PATH is unset

== nvidia-smi =================================================== NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

== cuda libs ===================================================

== cat /etc/issue =============================================== Linux Ryzen1800X 4.15.0-38-generic #41-Ubuntu SMP Wed Oct 10 10:59:38 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux VERSION="18.04.1 LTS (Bionic Beaver)" VERSION_ID="18.04" VERSION_CODENAME=bionic

== are we in docker ============================================= No

== compiler ===================================================== c++ (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0 Copyright (C) 2017 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

== uname -a ===================================================== Linux Ryzen1800X 4.15.0-38-generic #41-Ubuntu SMP Wed Oct 10 10:59:38 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

== check pips ===================================================

== check for virtualenv ========================================= False

== tensorflow import ============================================ tf.VERSION = 1.11.0 tf.GIT_VERSION = merge-180917-prev-912-g174d355 tf.COMPILER_VERSION = merge-180917-prev-912-g174d355

== env ========================================================== LD_LIBRARY_PATH is unset DYLD_LIBRARY_PATH is unset

== nvidia-smi =================================================== NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

== cuda libs ===================================================

== cat /etc/issue =============================================== Linux Ryzen1800X 4.15.0-38-generic #41-Ubuntu SMP Wed Oct 10 10:59:38 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux VERSION="18.04.1 LTS (Bionic Beaver)" VERSION_ID="18.04" VERSION_CODENAME=bionic

== are we in docker ============================================= No

== compiler ===================================================== c++ (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0 Copyright (C) 2017 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

== uname -a ===================================================== Linux Ryzen1800X 4.15.0-38-generic #41-Ubuntu SMP Wed Oct 10 10:59:38 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

== check pips ===================================================

== check for virtualenv ========================================= False

== tensorflow import ============================================ tf.VERSION = 1.11.0 tf.GIT_VERSION = merge-180917-prev-912-g174d355 tf.COMPILER_VERSION = merge-180917-prev-912-g174d355 Sanity check: array([1], dtype=int32)

== env ========================================================== LD_LIBRARY_PATH is unset DYLD_LIBRARY_PATH is unset

== nvidia-smi =================================================== NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

== cuda libs ===================================================

Describe the current behavior The performance of CNN mode is very low and locks up the PC when using larger batch size. The jupyter notebook at (https://github.com/briansp2020/kaggle_quora/blob/master/kernel-cnn.ipynb) locks up my PC when using batch size of 256. Reducing the batch size to 16 allows it to run on my local machine but it takes 3+ hours.

Describe the expected behavior The same code in Kaggle environment takes about 25 minutes per training epoch using batch size of 256.

pricebenjamin commented 5 years ago

I don't believe this is an issue with tensorflow-rocm. Is your GPU responsible for rendering your desktop environment? Does your desktop become completely unresponsive or just very slow to respond? Are you able to interrupt the Python kernel to regain responsiveness?

briansp2020 commented 5 years ago

Is your GPU responsible for rendering your desktop environment?

Yes.

Does your desktop become completely unresponsive or just very slow to respond?

Completely unresponsive. I admit, though, that I only waited a few minutes before hitting a reset button to reboot the machine.

Are you able to interrupt the Python kernel to regain responsiveness?

No.

Ben, Do you have access to VEGA FE? If you don't mind, could you try running the code on your setup and see if you can duplicate it? I got my VEGA FE from ebay. I've noticed some weird issues when using it under Windows. But since it worked OK so far under Linux, I just assumed that it was Windows driver problem. Do you think there maybe hardware problem?

BTW, do you work for AMD? I appreciate your help and guide.

Thanks!

pricebenjamin commented 5 years ago

I actually already ran your code on my Vega FE, and didn't have any issues. I don't use the Vega to render my desktop, however. (I recently switched from using a GTX 960 as my display driver to using my CPU's integrated graphics, specifically because I wanted to use the 960 for ML / compute purposes.)

One thing you might try: run the notebook and then try switching to TTY3 (CTRL + ALT + F3) and logging in to your user. If you're able to get there, you can try running rocm-smi or htop to see GPU or CPU usage, respectively, and kill the python process if you need to.

If you're unable to switch to TTY3, the last option would be to try a "soft reboot" by holding ALT + Print Screen while typing s, u, b. (Your computer should start rebooting once you press b.)

If your computer doesn't respond to anything, then perhaps this is an issue that should be considered by the dev team. I imagine some students / developers / researchers might only have access to their local GPU, and would like to train large networks without the system freezing...

Also, no, I don't work for AMD. I'm just particularly excited about this repository, and hope its community will keep growing. :+1:

P.S. Have you looked at this Keras tutorial? (Specifically the model at the bottom of the page.) It looks a lot like what you're working on. I'm not sure your current implementation is very efficient (i.e., Why Conv2D? Why such large convolution kernels?)

sunway513 commented 5 years ago

Thanks @pricebenjamin for the comments :-) Hi @briansp2020 , could you try only use the GPU for computing? Besides, there's enhancement in stability with ROCm2.0 kernel driver, please try that out. I'm closing this issue as for now, please feel free to reopen it if you still see the issues with ROCm2.0.

briansp2020 commented 5 years ago

Looks like it was just running out of resources. I now have Vega FE installed without driving the monitor. I'm not sure whether that made the difference or whether it was ROCm 2.0. But I do not get lock up anymore. I get the following error message.

[I 23:16:57.366 NotebookApp] Saving file at /kaggle_quora/kernel-cnn.ipynb 2019-01-15 23:17:42.926174: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1530] Found device 0 with properties: name: Device 6863 AMDGPU ISA: gfx900 memoryClockRate (GHz) 1.6 pciBusID 0000:0a:00.0 Total memory: 15.98GiB Free memory: 15.73GiB 2019-01-15 23:17:42.926308: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1641] Adding visible gpu devices: 0 2019-01-15 23:17:42.926472: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1051] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-01-15 23:17:42.926493: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1057] 0 2019-01-15 23:17:42.926503: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1070] 0: N 2019-01-15 23:17:42.926587: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1189] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15306 MB memory) -> physical GPU (device: 0, name: Device 6863, pci bus id: 0000:0a:00.0) 2019-01-15 23:18:13.852309: I tensorflow/core/kernels/conv_grad_filter_ops.cc:975] running auto-tune for Backward-Filter error: local memory limit exceeded (307204) in MIOpenCvBwdWrW MIOpen Error: /home/dlowell/MIOpenPrivate/src/tmp_dir.cpp:18: Can't execute cd /tmp/miopen-MIOpenConvBwdWrW_LxG_P53.cl-b2bc-9b5b-e769-9c68; /opt/rocm/bin/clang-ocl -DMLO_DIR_FORWARD=0 -DMLO_GRP_SZ=64 -DMLO_GRP_SZ0=64 -DMLO_GRP_SZ1=1 -DMLO_GRP_SZ2=1 -DMLO_FILTER_SIZE0=1200 -DMLO_FILTER_SIZE1=3 -DMLO_FILTER_PAD0=0 -DMLO_FILTER_PAD1=0 -DMLO_FILTER_STRIDE0=1 -DMLO_FILTER_STRIDE1=1 -DSTRIDE_W=1 -DSTRIDE_H=1 -DMLO_N_OUTPUTS=42 -DMLO_N_INPUTS=1 -DMLO_GROUP_COUNTS=1 -DMLO_N_INPUTS_PER_GROUP=1 -DMLO_N_OUTPUTS_PER_GROUP=42 -DMLO_BATCH_SZ=256 -DMLO_N_BATCH_LOOPS=1 -DMLO_OUT_BATCH_STRIDE=6216 -DMLO_OUT_CHANNEL_STRIDE=148 -DMLO_OUT_STRIDE=1 -DMLO_IN_BATCH_STRIDE=180000 -DMLO_IN_CHANNEL_STRIDE=180000 -DMLO_IN_STRIDE=1200 -DMLO_WEI_BATCH_STRIDE=3600 -DMLO_WEI_CHANNEL_STRIDE=3600 -DMLO_IN_WIDTH=1200 -DMLO_IN_HEIGHT=150 -DMLO_OUT_WIDTH=1 -DMLO_OUT_HEIGHT=148 -DMLO_IN_TILE1=1 -DMLO_IN_TILE0=5 -DMLO_N_LCL_BATCHS=1 -DMLO_N_LCL_OUT_MAPS=1 -DMLO_N_LCL_IN_MAPS=1 -DMLO_OUT_TILE0=1200 -DMLO_OUT_TILE1=3 -DMLO_OUT_STACKS=1 -DMLO_N_WAVES=1 -DMLO_READ_TYPE=_FLOAT4 -DMLO_READ_UNIT=4 -DMLO_HW_WAVE_SZ=64 -DMLO_LG2_PHYS_WAVE_SZ=6 -DMLO_IN_EXTENT1=5 -DMLO_IN_N_VERT_LOOPS=30 -DMLO_IN_WIDTH_CHUNK=1499 -DMLO_IN_WIDTH_N_LOOPS=-2147483647 -DMLO_IN_WIDTH_LAST_CHUNK_VALID_READ_UNITS=304 -DMLO_IN_WIDTH_LAST_CHUNK_VALID_PIXELS_IN_LAST_READ_UNIT=4 -DMLO_OUT_WIDTH_CHUNK=300 -DMLO_OUT_WIDTH_N_LOOPS=-2147483647 -DMLO_OUT_WIDTH_LAST_CHUNK_VALID_SPANS=1 -DMLO_OUT_WIDTH_LAST_CHUNK_VALID_PIXELS_IN_LAST_SPAN=1 -DMLO_CONV_BIAS=0 -DMLO_UT_READ_TYPE=_FLOAT4 -DMLO_UT_READ_UNIT=4 -DMLO_UT_GRP_SZ0=256 -DMIOPEN_USE_FP32=1 -DMIOPEN_USE_FP16=0 -mcpu=gfx900 -Wno-everything MIOpenConvBwdWrW_LxG_P53.cl -o /tmp/miopen-MIOpenConvBwdWrW_LxG_P53.cl-b2bc-9b5b-e769-9c68/MIOpenConvBwdWrW_LxG_P53.cl.o 2019-01-15 23:18:27.871325: F tensorflow/stream_executor/rocm/rocm_dnn.cc:3421] Check failed: status == miopenStatusSuccess (7 vs. 0)Unable to find a suitable algorithm for doing backward filter convolution

sunway513 commented 5 years ago

@briansp2020 can you try smaller batch sizes? If issue remains, please provide the logs with the following environment variable: export MIOPEN_ENABLE_LOGGING=1

briansp2020 commented 5 years ago

I think there is some issues with the code. I tried small batch size and it still runs out of memory. The code also has issues on NVidia platform as well. Since it does not hang the machine any more (though I don't know whether it is because I no longer have display connected Vega or I upgraded to ROCm 2.0), I consider the issue closed.

sunway513 commented 5 years ago

Thanks for the update @briansp2020 .