Closed tedliosu closed 5 months ago
Please let me know if there's anything else I may be able to contribute in order to resolve this issue.
Ok after doing some programming which refreshed my knowledge of how executables look for missing files on Linux in general, I discovered a pretty hacky work-around for this issue at hand:
(tensorflow_rocm) bkupuntu@opencl-os:~/github_repo_installs/benchmarks/scripts/tf_cnn_benchmarks$ ls -l *.bc
lrwxrwxrwx 1 bkupuntu bkupuntu 32 Oct 1 04:12 ockl.bc -> /opt/rocm/amdgcn/bitcode/ockl.bc
lrwxrwxrwx 1 bkupuntu bkupuntu 58 Oct 1 04:15 oclc_correctly_rounded_sqrt_on.bc -> /opt/rocm/amdgcn/bitcode/oclc_correctly_rounded_sqrt_on.bc
lrwxrwxrwx 1 bkupuntu bkupuntu 44 Oct 1 04:14 oclc_daz_opt_off.bc -> /opt/rocm/amdgcn/bitcode/oclc_daz_opt_off.bc
lrwxrwxrwx 1 bkupuntu bkupuntu 48 Oct 1 04:13 oclc_finite_only_off.bc -> /opt/rocm/amdgcn/bitcode/oclc_finite_only_off.bc
lrwxrwxrwx 1 bkupuntu bkupuntu 49 Oct 1 04:17 oclc_isa_version_1030.bc -> /opt/rocm/amdgcn/bitcode/oclc_isa_version_1030.bc
lrwxrwxrwx 1 bkupuntu bkupuntu 48 Oct 1 04:15 oclc_unsafe_math_off.bc -> /opt/rocm/amdgcn/bitcode/oclc_unsafe_math_off.bc
lrwxrwxrwx 1 bkupuntu bkupuntu 51 Oct 1 04:16 oclc_wavefrontsize64_on.bc -> /opt/rocm/amdgcn/bitcode/oclc_wavefrontsize64_on.bc
lrwxrwxrwx 1 bkupuntu bkupuntu 32 Oct 1 04:12 ocml.bc -> /opt/rocm/amdgcn/bitcode/ocml.bc
lrwxrwxrwx 1 bkupuntu bkupuntu 34 Oct 1 04:11 opencl.bc -> /opt/rocm/amdgcn/bitcode/opencl.bc
So the tensorflow rocm build is simply NOT looking in the correct directory for the bitcode files, which are under the /opt/rocm/amdgcn
directory... Still not sure what kind of patches are needed in the tensorflow source code in order for tensorflow to look in the correct directory for those bitcode files :confused:
Btw for others running into the same problem as I am, YMMV on which exact bitcode files should be linked to the current working directory based on what GPU you have (I have a gfx1030-based RX 6800).
I am getting the exact same error message, however, that happens even without any environment variables. I am unable to run tensorflow-rocm on a 6900xt under rocm 5.4.0. It used to work just fine previously.
There are similar reports here: https://github.com/RadeonOpenCompute/ROCm/issues/1796
Any idea on how to get it to work again? Are you symlinking the .bc files? Or what exactly are you proposing as a hacky solution? Right now, tensorflow is unusable on RDNA3 cards as far as I can tell.
I am getting the exact same error message, however, that happens even without any environment variables. I am unable to run tensorflow-rocm on a 6900xt under rocm 5.4.0. It used to work just fine previously.
There are similar reports here: RadeonOpenCompute/ROCm#1796
Any idea on how to get it to work again? Are you symlinking the .bc files? Or what exactly are you proposing as a hacky solution? Right now, tensorflow is unusable on RDNA3 cards as far as I can tell.
@Mushoz yes I am simply symlinking the appropriate files into the current working directory as shown in my previous comment; ymmv as to which exact files to symlink (I just kept symlinking each file each error told me it was looking for until all errors went away) bc it appears to be architecture dependent. Sorry to hear that you're running into even worse issues and hopefully my solution helps to fix them 🥺
Cheers, that worked wonderfully! I really wonder why this isn't reported by more people. A simple model with just one dense layer with some randomly generated features and targets refuses to run on my 6900XT, so even in the most simple of cases it's completely broken without symlinking. I did not have to do that previously, so that's a big regression. This is all without any switches, just a purely stock tensorflow-rocm installation and execution.
hey, so i was having this same issue Radeon Pro VII (gfx906) on ubuntu 22.04 using rocm 5.4.1 and it turns out that if i set the ROCM_PATH
to /opt/rocm
(which is where all the library and bitcode goodies are), XLA could compile and run.
@tedliosu I can also confirm this issue with my 6800XT and your solution working for me as well. Seems like there should be an environment variable that should resolve what is essentially a path problem. Updating the ROCM_PATH
didn't help for me, though.
FYI, I am observing this problem with ROCM 5.4.1.
hey, so i was having this same issue Radeon Pro VII (gfx906) on ubuntu 22.04 using rocm 5.4.1 and it turns out that if i set the
ROCM_PATH
to/opt/rocm
(which is where all the library and bitcode goodies are), XLA could compile and run.
I used to have this issue taken care by setting ROCM_HOME
in the past; but this time it needed ROCM_PATH
. As far as my environment is concerned, they are identical to yours.
@tedliosu I can also confirm this issue with my 6800XT and your solution working for me as well. Seems like there should be an environment variable that should resolve what is essentially a path problem. Updating the
ROCM_PATH
didn't help for me, though.FYI, I am observing this problem with ROCM 5.4.1.
@jasondrusso Unfortunately, as maintenance for the original code-base used to reproduce this issue has long been abandoned (see this comment for more info), I am no longer able to test whether or not setting ROCM_PATH
solves the issue as presented here, at least not without using a completely different code-base as a reproducer. Attempting to run the benchmark that led me to this issue in the first place resulted in this error by the way:
Traceback (most recent call last):
File "/home/bkupuntu/github_repo_installs/old_tf_benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 68, in <module>
app.run(main) # Raises error on invalid flags, unlike tf.app.run()
File "/home/bkupuntu/tensorflow_rocm/lib/python3.10/site-packages/absl/app.py", line 308, in run
_run_main(main, args)
File "/home/bkupuntu/tensorflow_rocm/lib/python3.10/site-packages/absl/app.py", line 254, in _run_main
sys.exit(main(argv))
File "/home/bkupuntu/github_repo_installs/old_tf_benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 59, in main
tfversion = cnn_util.tensorflow_version_tuple()
File "/home/bkupuntu/github_repo_installs/old_tf_benchmarks/scripts/tf_cnn_benchmarks/cnn_util.py", line 27, in tensorflow_version_tuple
major, minor, patch = v.split('.')
ValueError: too many values to unpack (expected 3)
So if you don't mind, could you please provide a minimal working example of the code that you were working with that led you to the same error that I originally arrived at as well? Otherwise I unfortunately can't help confirm whether or not this issue is purely a user configuration issue :slightly_frowning_face:
Thanks in advance :smiley:
Since I broke the system containing my RX 6800 while attempting to upgrade its system memory, and no longer have the time nor energy to maintain my own desktop system, I just sold my RX 6800 (my only AMD GPU). Therefore, since I will not be able to repro any potential fix of this issue anymore, I am closing this issue for the time being. Will be more than willing to reopen this if anyone else runs into the same issue as me.
sorry pressed wrong button closing now
System information
You can collect some of this information using our environment capture script You can also obtain the TensorFlow version with:
python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"
Describe the current behavior
git clone https://github.com/tensorflow/benchmarks.git
cd ./benchmarks/scripts/tf_cnn_benchmarks/
Running:
TF_XLA_FLAGS=--tf_xla_auto_jit=2 TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50
results in the following error:2 root error(s) found. (0) INTERNAL: bitcode module not found at ./opencl.bc [[{{node cluster_3_1/xla_compile}}]] [[cluster_1_1/merge_oidx_0/_567]] (1) INTERNAL: bitcode module not found at ./opencl.bc [[{{node cluster_3_1/xla_compile}}]] 0 successful operations. 0 derived errors ignored.
Describe the expected behavior
tf_cnn_benchmarks.py
; the errors did not appear with ROCm version 4.5.2 and tensorflow-rocm version 2.7.0 (I've tried using tensorflow-rocm version 2.7.0 and version 2.7.1 with ROCm version 5.0.1, but tensorflow complained that it couldn't find "libamdhip64.so.4")Contributing
Standalone code to reproduce the issue Please refer to steps above in reproducing issue to
git clone
the code from GitHub.