ROCm / tensorflow-upstream

TensorFlow ROCm port
https://tensorflow.org
Apache License 2.0
685 stars 94 forks source link

INTERNAL: bitcode module not found at ./opencl.bc when running with "TF_XLA_FLAGS=--tf_xla_auto_jit=2" #1591

Closed tedliosu closed 5 months ago

tedliosu commented 2 years ago

System information

You can collect some of this information using our environment capture script You can also obtain the TensorFlow version with:

  1. TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
  2. TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"

Describe the current behavior

  1. git clone https://github.com/tensorflow/benchmarks.git
  2. cd ./benchmarks/scripts/tf_cnn_benchmarks/
  3. Running:

    • TF_XLA_FLAGS=--tf_xla_auto_jit=2 TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 results in the following error:
      
      2022-02-27 16:36:42.512830: E tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:292] bitcode module is required by this HLO module but was not found at ./opencl.bc
      2022-02-27 16:36:42.513381: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at xla_ops.cc:436 : INTERNAL: bitcode module not found at ./opencl.bc
      INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InternalError'>, Graph execution error:

    2 root error(s) found. (0) INTERNAL: bitcode module not found at ./opencl.bc [[{{node cluster_3_1/xla_compile}}]] [[cluster_1_1/merge_oidx_0/_567]] (1) INTERNAL: bitcode module not found at ./opencl.bc [[{{node cluster_3_1/xla_compile}}]] 0 successful operations. 0 derived errors ignored.

    
    I've attached the full output of the command at this step in [this file](https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/files/8149498/output_leading_to_error.txt). 

Describe the expected behavior

Contributing

Standalone code to reproduce the issue Please refer to steps above in reproducing issue to git clone the code from GitHub.

tedliosu commented 2 years ago

Please let me know if there's anything else I may be able to contribute in order to resolve this issue.

tedliosu commented 2 years ago

Ok after doing some programming which refreshed my knowledge of how executables look for missing files on Linux in general, I discovered a pretty hacky work-around for this issue at hand:

(tensorflow_rocm) bkupuntu@opencl-os:~/github_repo_installs/benchmarks/scripts/tf_cnn_benchmarks$ ls -l *.bc
lrwxrwxrwx 1 bkupuntu bkupuntu 32 Oct  1 04:12 ockl.bc -> /opt/rocm/amdgcn/bitcode/ockl.bc
lrwxrwxrwx 1 bkupuntu bkupuntu 58 Oct  1 04:15 oclc_correctly_rounded_sqrt_on.bc -> /opt/rocm/amdgcn/bitcode/oclc_correctly_rounded_sqrt_on.bc
lrwxrwxrwx 1 bkupuntu bkupuntu 44 Oct  1 04:14 oclc_daz_opt_off.bc -> /opt/rocm/amdgcn/bitcode/oclc_daz_opt_off.bc
lrwxrwxrwx 1 bkupuntu bkupuntu 48 Oct  1 04:13 oclc_finite_only_off.bc -> /opt/rocm/amdgcn/bitcode/oclc_finite_only_off.bc
lrwxrwxrwx 1 bkupuntu bkupuntu 49 Oct  1 04:17 oclc_isa_version_1030.bc -> /opt/rocm/amdgcn/bitcode/oclc_isa_version_1030.bc
lrwxrwxrwx 1 bkupuntu bkupuntu 48 Oct  1 04:15 oclc_unsafe_math_off.bc -> /opt/rocm/amdgcn/bitcode/oclc_unsafe_math_off.bc
lrwxrwxrwx 1 bkupuntu bkupuntu 51 Oct  1 04:16 oclc_wavefrontsize64_on.bc -> /opt/rocm/amdgcn/bitcode/oclc_wavefrontsize64_on.bc
lrwxrwxrwx 1 bkupuntu bkupuntu 32 Oct  1 04:12 ocml.bc -> /opt/rocm/amdgcn/bitcode/ocml.bc
lrwxrwxrwx 1 bkupuntu bkupuntu 34 Oct  1 04:11 opencl.bc -> /opt/rocm/amdgcn/bitcode/opencl.bc

So the tensorflow rocm build is simply NOT looking in the correct directory for the bitcode files, which are under the /opt/rocm/amdgcn directory... Still not sure what kind of patches are needed in the tensorflow source code in order for tensorflow to look in the correct directory for those bitcode files :confused:

Btw for others running into the same problem as I am, YMMV on which exact bitcode files should be linked to the current working directory based on what GPU you have (I have a gfx1030-based RX 6800).

Mushoz commented 1 year ago

I am getting the exact same error message, however, that happens even without any environment variables. I am unable to run tensorflow-rocm on a 6900xt under rocm 5.4.0. It used to work just fine previously.

There are similar reports here: https://github.com/RadeonOpenCompute/ROCm/issues/1796

Any idea on how to get it to work again? Are you symlinking the .bc files? Or what exactly are you proposing as a hacky solution? Right now, tensorflow is unusable on RDNA3 cards as far as I can tell.

tedliosu commented 1 year ago

I am getting the exact same error message, however, that happens even without any environment variables. I am unable to run tensorflow-rocm on a 6900xt under rocm 5.4.0. It used to work just fine previously.

There are similar reports here: RadeonOpenCompute/ROCm#1796

Any idea on how to get it to work again? Are you symlinking the .bc files? Or what exactly are you proposing as a hacky solution? Right now, tensorflow is unusable on RDNA3 cards as far as I can tell.

@Mushoz yes I am simply symlinking the appropriate files into the current working directory as shown in my previous comment; ymmv as to which exact files to symlink (I just kept symlinking each file each error told me it was looking for until all errors went away) bc it appears to be architecture dependent. Sorry to hear that you're running into even worse issues and hopefully my solution helps to fix them 🥺

Mushoz commented 1 year ago

Cheers, that worked wonderfully! I really wonder why this isn't reported by more people. A simple model with just one dense layer with some randomly generated features and targets refuses to run on my 6900XT, so even in the most simple of cases it's completely broken without symlinking. I did not have to do that previously, so that's a big regression. This is all without any switches, just a purely stock tensorflow-rocm installation and execution.

xupit3r commented 1 year ago

hey, so i was having this same issue Radeon Pro VII (gfx906) on ubuntu 22.04 using rocm 5.4.1 and it turns out that if i set the ROCM_PATH to /opt/rocm (which is where all the library and bitcode goodies are), XLA could compile and run.

jasondrusso commented 1 year ago

@tedliosu I can also confirm this issue with my 6800XT and your solution working for me as well. Seems like there should be an environment variable that should resolve what is essentially a path problem. Updating the ROCM_PATH didn't help for me, though.

FYI, I am observing this problem with ROCM 5.4.1.

vsrikarunyan commented 1 year ago

hey, so i was having this same issue Radeon Pro VII (gfx906) on ubuntu 22.04 using rocm 5.4.1 and it turns out that if i set the ROCM_PATH to /opt/rocm (which is where all the library and bitcode goodies are), XLA could compile and run.

I used to have this issue taken care by setting ROCM_HOME in the past; but this time it needed ROCM_PATH. As far as my environment is concerned, they are identical to yours.

tedliosu commented 1 year ago

@tedliosu I can also confirm this issue with my 6800XT and your solution working for me as well. Seems like there should be an environment variable that should resolve what is essentially a path problem. Updating the ROCM_PATH didn't help for me, though.

FYI, I am observing this problem with ROCM 5.4.1.

@jasondrusso Unfortunately, as maintenance for the original code-base used to reproduce this issue has long been abandoned (see this comment for more info), I am no longer able to test whether or not setting ROCM_PATH solves the issue as presented here, at least not without using a completely different code-base as a reproducer. Attempting to run the benchmark that led me to this issue in the first place resulted in this error by the way:

Traceback (most recent call last):
  File "/home/bkupuntu/github_repo_installs/old_tf_benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 68, in <module>
    app.run(main)  # Raises error on invalid flags, unlike tf.app.run()
  File "/home/bkupuntu/tensorflow_rocm/lib/python3.10/site-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/home/bkupuntu/tensorflow_rocm/lib/python3.10/site-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "/home/bkupuntu/github_repo_installs/old_tf_benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 59, in main
    tfversion = cnn_util.tensorflow_version_tuple()
  File "/home/bkupuntu/github_repo_installs/old_tf_benchmarks/scripts/tf_cnn_benchmarks/cnn_util.py", line 27, in tensorflow_version_tuple
    major, minor, patch = v.split('.')
ValueError: too many values to unpack (expected 3)

So if you don't mind, could you please provide a minimal working example of the code that you were working with that led you to the same error that I originally arrived at as well? Otherwise I unfortunately can't help confirm whether or not this issue is purely a user configuration issue :slightly_frowning_face:

Thanks in advance :smiley:

tedliosu commented 5 months ago

Since I broke the system containing my RX 6800 while attempting to upgrade its system memory, and no longer have the time nor energy to maintain my own desktop system, I just sold my RX 6800 (my only AMD GPU). Therefore, since I will not be able to repro any potential fix of this issue anymore, I am closing this issue for the time being. Will be more than willing to reopen this if anyone else runs into the same issue as me.

tedliosu commented 5 months ago

sorry pressed wrong button closing now