Closed Cvikli closed 5 years ago
Hi @Cvikli , we are finalizing the 2.0-alpha docker image and will be available soon, please stay tuned.
Hi @Cvikli , we've pushed out the preview build docker image for TF2.0-alpha0: rocm/tensorflow:tf2.0-alpha0-preview Please help review it and let us know your feedback :-) Here's the link to our dockerhub repo: https://cloud.docker.com/u/rocm/repository/docker/rocm/tensorflow/general
Great! Just ordered our first card for testing. :) If the delivery and tests go well, then I will be back with results by April 2.
Thank you for the fast work! I am really excited about it!
Please open a new issue if bugs are found with the 2.0 docker.
Sorry for opening the thread but I own you guys with a lot!
The RADEON VII's performance is crazy with tensorflow 2.0a. In our tests, we reached close to the same speed like our 2080ti(about 10-15% less)! But the Radeon VII has more memory which was a bottleneck in our case. On this price this videocard has the best value to do machine learning we think that in our company!
We are glad to open our eyes towards AMD products, we are buying our first configuration which is 40% cheaper and as we measured capable to perform better in our scenario than our well optimised server configuration.
Thank you for all the work!
@Cvikli
We are glad to open our eyes towards AMD products, we are buying our first configuration which is 40% cheaper and as we measured capable to perform better in our scenario than our well optimised server configuration.
Could you give a bit more detail? How much faster is Radeon VII for your application? What type of mode are you running (CNN/RNN/GAN/etc.)? What processor are you running?
Just curious.
Thank you @Cvikli , great to hear that your experiment went well and you are going to invest more on ROCm and AMD GPUs!
The system is something like this:
The result with RNN networks on 1 Radeon VII and 1080ti was close to the same.
Now after we switched over to 4 Radeon VII, we face two big scaling issue on convolutional networks.
2019-05-12 15:28:04.632396: E tensorflow/stream_executor/rocm/rocm_driver.cc:629] failed to allocate 14.95G (16049923584 bytes) from device: hipError_t(1002)
2019-05-12 15:28:04.632456: E tensorflow/stream_executor/rocm/rocm_driver.cc:629] failed to allocate 13.45G (14444931072 bytes) from device: hipError_t(1002)
2019-05-12 15:28:04.632475: E tensorflow/stream_executor/rocm/rocm_driver.cc:629] failed to allocate 12.11G (13000437760 bytes) from device: hipError_t(1002)
... many lines like this
2019-05-12 15:36:58.756188: E tensorflow/stream_executor/rocm/rocm_driver.cc:629] failed to allocate 310.35M (325421568 bytes) from device: hipError_t(1002)
2019-05-12 15:36:58.756226: E tensorflow/stream_executor/rocm/rocm_driver.cc:629] failed to allocate 279.31M (292879616 bytes) from device: hipError_t(1002)
2019-05-12 15:36:58.756252: E tensorflow/stream_executor/rocm/rocm_driver.cc:629] failed to allocate 251.38M (263591680 bytes) from device: hipError_t(1002)
2019-05-12 15:36:58.756279: E tensorflow/stream_executor/rocm/rocm_driver.cc:629] failed to allocate 226.24M (237232640 bytes) from device: hipError_t(1002)
2019-05-12 15:36:58.756304: E tensorflow/stream_executor/rocm/rocm_driver.cc:629] failed to allocate 203.62M (213509376 bytes) from device: hipError_t(1002)
2019-05-12 15:36:58.756323: E tensorflow/stream_executor/rocm/rocm_driver.cc:629] failed to allocate 183.26M (192158464 bytes) from device: hipError_t(1002)
2019-05-12 15:36:58.756343: E tensorflow/stream_executor/rocm/rocm_driver.cc:629] failed to allocate 164.93M (172942848 bytes) from device: hipError_t(1002)
2019-05-12 15:37:01.337949: E tensorflow/stream_executor/rocm/rocm_driver.cc:493] failed to memset memory: HIP_ERROR_InvalidValue
Segmentation fault (core dumped)
We are pretty sure things should work, because it was working with NVidia 1080ti. However inspite of it writes, that it failed to allocate the memory, the whole program just start and somehow running normally I think.
Can this happen because of the docker image, we can't use separate GPUs for different runs?
What do you guys think about this? Is this normal that we get 10x slower speed when it comes to cudNN? (For me cuDNN sounds totally a software with better arithmetic operations I guess, is it possible to improve on this?)
Hi @Cvikli , let's step back a bit and look at your system configuration:
- 4x SAPPHIRE Radeon VII
- 2x G.SKILL FlareX 64GB
- 1x Thermaltake Toughpower 1500W Gold
The typical gold workstation power supply would run at 87% efficiency at full load, therefore it can supposedly power up to 1307W.
TR 2950x TDP is measured at 180W, Radeon VII TDP is 300W, but the peak power consumption can go up to 321.8W (according to third-party measurement here).
Considering the other components on your workstation, the current 1500W is not sufficient for your system at full load. We'd recommend you to go for 1800W PSU or dual 1000W PSU for your system provide sufficient juices for 4 Radeon VII GPUs.
2019-05-12 15:28:04.632396: E tensorflow/stream_executor/rocm/rocm_driver.cc:629] failed to allocate 14.95G (16049923584 bytes) from device: hipError_t(1002)
The above error message indicates the target GPU device memory has already been allocated by the other processes. There're a couple of solutions to expose only selected GPUs to the user process:
sudo docker run -it --network=host --device=/dev/kfd --device=/dev/dri/renderD128 --group-add video
Note you show see the following four interfaces for your 4xRadeon VII system:
$ ls /dev/dri/render*
/dev/dri/renderD128 /dev/dri/renderD129 /dev/dri/renderD130 /dev/dri/renderD131
We recommend approach #3, as that would isolate the GPUs at a relatively lower level of the ROCm stack.
For your concern on mGPU performance, could you provide the exact commands to reproduce your observations?
Just FYI, we have been actively running regressions tests for single node multi-GPU performance, and there's no mGPU performance regression issue reported for TF1.13 on ROCm2.4 release.
After you can resolve the concern on the power supply, for tf_cnn_benchmarks resnet50 as an example, you should be able to see near-linear scalability on FP32 using the following command with 4 GPUs:
TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --data_format=NCHW --batch_size=128 --model=resnet50 --optimizer=sgd --num_batches=100 --variable_update=replicated --nodistortions --gpu_thread_mode=gpu_shared --num_gpus=4 --all_reduce_spec=pscpu --print_training_accuracy=True --display_every=10
hank you for the 3 different ways to manage visible devices. The second solution (with export ROCR_VISIBLE_DEVICES=0) WORKED like a charm for us! Interestingly the third solution didn't restrict the available GPU devices in the docker container.
Ran some test on TF2.0 on ROCm2.4 and performance is still a lot lower than what an Nvidia 1080Ti can provide benchmarking on MobileNetv2, what bothers us yet a little. To get some direction for the TF2.0 ROCm2.4, I thought I share these logs. Before the calculations would start for a MobileNetV2:
2019-05-13 18:48:40.653042: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library librocblas.so
2019-05-13 18:48:40.683726: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libMIOpen.so
2019-05-13 18:48:44.998231: I tensorflow/core/kernels/conv_grad_input_ops.cc:997] running auto-tune for Backward-Data
2019-05-13 18:48:45.094061: I tensorflow/core/kernels/conv_grad_filter_ops.cc:886] running auto-tune for Backward-Filter
... 2x14 lines like this with Backward-Data and Backward-Filter
2019-05-13 18:48:48.854030: I tensorflow/core/kernels/conv_grad_input_ops.cc:997] running auto-tune for Backward-Data
2019-05-13 18:48:48.945517: I tensorflow/core/kernels/conv_grad_filter_ops.cc:886] running auto-tune for Backward-Filter
2019-05-13 18:48:49.207930: I tensorflow/core/kernels/conv_grad_input_ops.cc:997] running auto-tune for Backward-Data
2019-05-13 18:48:49.295100: I tensorflow/core/kernels/conv_grad_filter_ops.cc:886] running auto-tune for Backward-Filter
2019-05-13 18:48:50.639570: I tensorflow/core/kernels/conv_grad_filter_ops.cc:886] running auto-tune for Backward-Filter
So I pretty much feel like we are running some operations 19 times, which leads to 10-15x speed loss, but it is only a guess. If I can help in any other way let me know.
PS.: on TF2.0 ROCm2.4, I couldn't run the tf_cnn_benchmarks.py because missing tensorflow.contrib.
Hi @Cvikli , glad the ROCr env var worked for you! For approach #3, if you run ROCr level utils you should see the restricted access (e.g. /opt/rocm/bin/rocminfo); however, since rocm_smi uses different approaches to query the GPU status, you can still see all the GPUs using rocm_smi even you pass limited GPU device interfaces to docker container. Adding @jlgreathouse @y2kenny for awareness.
2019-05-13 18:48:44.998231: I tensorflow/core/kernels/conv_grad_input_ops.cc:997] running auto-tune for Backward-Data 2019-05-13 18:48:45.094061: I tensorflow/core/kernels/conv_grad_filter_ops.cc:886] running auto-tune for Backward-Filter
The above logs indicate the time spent there was actually for MIOpen to compile kernels, please refer to my previous comment here for reference. Those are one-time effort, for the latter runs MIOpen will just pick the cached kernels under ~/.cache/miopen instead of compiling those again. If you have been using docker containers for the dev work, you can consider committing the docker container with MIOpen cache compiled so you can reuse those for later reference.
Besides, if your application is built on TF1.x api, you might use the following TF1.13 release instead of using TF2.0 branch built with --config=v1: rocm/tensorflow:rocm2.4-tf1.13-python3
We ported our code from tf2.0 to tf1.13 and run the MobileNetV2 implementation from tf.keras.applications on the configuration you suggested (TF1.13 on ROCm2.4 release), and we still see NO improvement in speed. Nvidia 1080Ti still performs 5-10x faster. I don't know if it is, because cudnn or cuda is not availabe for Radeon cards, but this performance difference is pretty high.
Hi @Cvikli , could you provide the exact steps to repro your observation? FYI, Tensorflow-ROCm deploys the ROCm MIOpen library to accelerate the DL workloads, the repo is here: https://github.com/ROCmSoftwarePlatform/MIOpen
Anyone tested with the latest Macbook pros?
I run into the error "failed to allocate 14.95G (16049923584 bytes) from device: hipError_t(1002)" as above. System info: Intel® Xeon(R) CPU E5-2630 v2 @ 2.60GHz × 12 Radeon VII 1500 W PSU ROCm installed with Tensorflow-rocm 1.13.1 (through pip3)
I have not tried install tensorflow-rocm through docker.
Any help?
Hi @quocdat32461997 , can you try to set the following environment variables:
export HIP_HIDDEN_FREE_MEM=500
If it still fails, please create a new issue and provide more complete logs.
Problem solved by re-installing ROCm and Tensorflow-rocm. Proabably I did not install the ROCm properly. Thanks a lot.
Hey there! I would like to know if there will be a new docker image with tensorflow==2.0.0b installed, because now still only alpha version is available for tf2.0. By the way we ran the https://github.com/lambdal/lambda-tensorflow-benchmark tests, and the difference between an Nvidia and the Radeon cards are less then stated above. If you are interested I can share the tests results here.
Hi @Cvikli , we are preparing the TF2.0 beta release, it's currently under QA test coverage. We'll update here after the new docker image is available.
You guys, you are crazy! I love it! :) Thank you for this speed!
Looks like the link at the beginning of the thread redirects to https://hub.docker.com, here's the link I'm using to track releases: https://hub.docker.com/r/rocm/tensorflow/tags
Hi @Cvikli , we have published the docker container for TF-ROCm 2.0 Beta1. Please kindly check it and let us know if you have any questions: rocm/tensorflow:rocm2.5-tf2.0-beta1-config-v2
Hi everyone,
when I run the rocm/tensorflow:rocm2.5-tf2.0-beta1-config-v2 docker container or any other container with tensorflow 2.0, trying to import tensorflow results in following error:
>>> import tensorflow as tf
Illegal instruction (core dumped)
I am using a rx 480 with rocm 2.5 and rocm with tensorflow 1.13 works fine.
Hi @moonshine502 , I've tried a couple of samples using the rocm2.5-tf2.0-beta1-config-v2 docker image on my GFX803 node, those are working fine. Could you provide the steps to reproduce your issue?
Hi @sunway513, thank you for your response.
Hardware: Intel Celeron G3900 (Skylake), AMD Radeon RX 480 (gfx803) Software:
Issue:
Executing python3 -c "import tensorflow as tf"
inside the docker results in
python3 -c "import tensorflow as tf"
Illegal instruction (core dumped)
I am guessing that this error is caused by the cpu not being compatible with the new tensorflow version. Could this be the case?
@moonshine502 I'm running almost the exact same system setup and its able to load and train for me.
The only difference appears to be the CPU, or possibly the card. I'm using a Ryzen 5 2400G; everything else looks near the same. I'm using a RX560 14cu, which registers in linux as an RX480 (gfx803), ROCM 2.5.27.
I ran through all the steps for training a mnist dataset at the link below to confirm tf2.0 was actually working, the accuracy for the evaluation wasn't the best (~87.7%) vs (98%) but it was able to compute.
https://www.tensorflow.org/beta/tutorials/quickstart/beginner
Edit: included more info.
Hi @dundir, @sunway513,
I am now pretty sure that the cause of the problem is my cpu which does not support avx instructions. It seems that previous versions of tensorflow with rocm were compiled without avx, because they work on my machine. So I may try to build tensorflow 2.0 without avx or get a new cpu.
Thank you for your help.
@sunway513 It looks like there may be an rocm related issue with the accuracy for training a basic mnist model.
Running this code: here GPU passthru stdout: here
The docker container was set up with the same passthru options as 1.13, the resulting accuracy diverged to 87% accuracy from the baseline of 97%, and the overall computation time diverged 44s of training for 5 epochs, from the baseline of 20s (nopassthru).
No dev passthru stdout: here
@sunway513 Looks like the accuracy issue I previously mentioned regarding mnist was resolved with the latest tf2.0 docker image (rocm/tensorflow:rocm2.6-tf2.0-config-v2-dev).
Thanks, and much appreciated. You guys are doing an awesome job.
Memory being the bottleneck, can we do bfloat16 and int8, float8, float16? Just curious
We ported our code from tf2.0 to tf1.13 and run the MobileNetV2 implementation from tf.keras.applications on the configuration you suggested (TF1.13 on ROCm2.4 release), and we still see NO improvement in speed. Nvidia 1080Ti still performs 5-10x faster. I don't know if it is, because cudnn or cuda is not availabe for Radeon cards, but this performance difference is pretty high.
cuDNN is not purely software play and is backed by actual silicon (dedicated tensor cores for MAD ops) which boosts half-precision performance. I'll need to check if Radeon VII has dedicated tensor cores as well. Also, nvidia won't automatically optimize code to make use of tensor cores, that has to be done w/ using cuDNN extensions
@salmanulhaq 1080Ti has no tensor cores.
We ported our code from tf2.0 to tf1.13 and run the MobileNetV2 implementation from tf.keras.applications on the configuration you suggested (TF1.13 on ROCm2.4 release), and we still see NO improvement in speed. Nvidia 1080Ti still performs 5-10x faster. I don't know if it is, because cudnn or cuda is not availabe for Radeon cards, but this performance difference is pretty high.
cuDNN is not purely software play and is backed by actual silicon (dedicated tensor cores for MAD ops) which boosts half-precision performance. I'll need to check if Radeon VII has dedicated tensor cores as well. Also, nvidia won't automatically optimize code to make use of tensor cores, that has to be done w/ using cuDNN extensions
do u have a referece for hardware being involved in CUDNN?
CUDNN afaik is pure software play with optimization and what not , what u may be referring to is TENSOR cores which was added to packaged on Volta and carried to Turing silicons.
Anybody tried TF 2.0 with a Radeon RX 580, with 8GB RAM? Does it work? If it does, has anybody tried running multiple cards in parallel?
I have one of the first generation Nvidia Titan X cards (pre-pascal). I'm finally giving up on it. It can only run CUDA drivers from a long time ago, from the year the card first was produced. Anything newer, I've tried them all, and the card won't initialize (i.e. - O/S rejects it at the device level). Very sad about this since I pad a ton for it, but it's time to move on.
It ought to work but I'm not convinced that there's a point in running multiple 580s on a single training task. I don't think they'd be fast enough to gain a meaningful speedup (I didn't test rocm, but in a rendering task between a VII and a 580, it was faster to just use the VII than to have them both work together).
Anyone tested with the latest Macbook pros?
Can anyone reply to @QuantumInformation question please?
I've now upgraded to the new MBP 16, but not used TFJS for a while, might get into py soon.
Hi @QuantumInformation @kuabhish , please refer to the following doc for ROCm support coverage over OSes: https://github.com/RadeonOpenCompute/ROCm#deploying-rocm There's another thread discussing the Mac support on main ROCm repo: https://github.com/RadeonOpenCompute/ROCm/issues/262
Hi Cvikli,
I am having radeon-vii but not able to configure with tensorflow. Please guide me. I was struggling to configure this for more than 15 days. Can I use the my gpu without docker ? Can i use the tensorflow 1.x with gpu. I had installed the rocm but still gpu is bot responding while training my model.
My system config: OS: Ubuntu 18.04 Thanks Suman
Hi @sumannelli , did you follow the following instructions to install TF? https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/blob/develop-upstream/rocm_docs/tensorflow-install-basic.md
And certainly, you can use your GPU without docker, that's just a matter of deployment approach -- using docker would likely help you save some time config the user bit environment with ROCm.
HI Sunway513, Thanks for the reply. I can able to use the AMD radeon Vii with Tensorflow2.1 but while my model is training, it is using only 3% of memory only. OS: ubuntu 18.04 kernel: 5.3 rocm:3.1.3 tensorlow:2.1 If I am using any incompatible version please let me know. once again thanks for the quick reply. Thanks Suman Nelli
Hi, Guys My CPU specs are Ryzen 5 3600 and AMD Radeon RX 5500 XT Is there any way I could enable TensorFlow GPU using Rocm or other platforms? Please help me out
HI @Sifatul22 , your configuration should work. Please follow the document here to install ROCm and Tensorflow-rocm: https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/blob/develop-upstream/rocm_docs/tensorflow-install-basic.md Let us know if you have questions, thanks.
@sunway513 Is Navi now supported? Radeon RX 5500 XT is Navi, isn't it?
Hi @briansp2020 , Navi is not supported by ROCm yet, please refer to the following document for the GPU GPU list supported by ROCm: https://github.com/RadeonOpenCompute/ROCm#supported-gpus
Hi sunway513, I referred the link you provided to install the Rocm, it is installing with python 2.7. But I want to install with python 3.6. https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/blob/develop-upstream/rocm_docs/tensorflow-install-basic.md Please suggest me on this. Thanks
Hi @sumannelli , in the same document, if you follow the steps to install python3 dependencies, depends on the default python3 version you have in your environment, you should be able to configure it correctly.
@Hi sunway513,
Thanks for the reply Now I can run tensorflow2 on AMD radeon Vii.
But now I am using object detection api which support tensorflow1.15.0, when i installed thetensorflow-rocm==1.15.0 ,getting the error as"
aceback (most recent call last):
File "/home/ideabytes/anaconda3/envs/tf/lib/python3.6/site-packages/tensorflow_core/python/pywrap_tensorflow.py", line 58, in
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "a.ipynb", line 1, in
Failed to load the native TensorFlow runtime.
See https://www.tensorflow.org/install/errors
for some common reasons and solutions. Include the entire stack trace above this error message when asking for help.
Thanks Suman Nelli
Hi sunway513, The Rocm 3.1 is not working with Tensorflow-rocm=1.15.0. Please provide the link or reference to download the Rocm 2.10 Note: when using the below command it is downloading Rocm 3.1. But I need 2.1
sudo apt install rocm-dkms My work has stopped because of this. kindly reply me.
I would be curious if Tensorflow 2.0 works with AMD Radeon VII?
Also, if it is available, are there any benchmark comparison with 2080Ti on some standard network to see if we should invest in Radeon VII clusters?