Inconsistent results with ROCm/Radeon VII

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04
TensorFlow installed from (source or binary): binary (pip install tensorflow-rocm)
TensorFlow version (use command below): merge-190415-2708-g9e981a8 1.14.1
Python version: 3.6.7 (pyenv)
ROCm/MIOpen version: 2.7
GPU model and memory: Radeon VII 16gb

Describe the current behavior

Identical code leads to different results with ROCm and CUDA, while taking up twice the memory and running at a third of the speed when compared to a Tesla P100. The program is a tensorflow port of a tool originally written in numpy/cupy. The port produces the same results on CUDA as the original numpy/cupy version.

For ROCm, eager mode and a port to graph mode have massive differences in performance and also produce different results. On the Tesla P100, both versions produce the same results and take roughly the same time.

Describe the expected behavior ROCm should produce the same results as CUDA.

Code to reproduce the issue The port can be found at https://github.com/sebpuetz/cross_lingual_embeddings. The README describes dependencies. The training procedure in the README is not fit to reproduce this issue since it includes a random component. To disable this random component, set the flag --init_keep_prob 1.

After installing the dependencies the following should set up reproduction:

git clone https://github.com/sebpuetz/cross_lingual_embeddings -b graph
cd cross_lingual_embeddings
wget http://ixa2.si.ehu.es/martetxe/vecmap/en.emb.txt.gz
wget http://ixa2.si.ehu.es/martetxe/vecmap/it.emb.txt.gz
gunzip en.emb.txt.gz
gunzip it.emb.txt.gz
ff-convert -f textdims --lossy en.emb.txt en.emb.fifu
ff-convert -f textdims --lossy it.emb.txt it.emb.fifu
python src/map.py en.emb.fifu it.emb.fifu en.map.txt it.map.txt --init_keep_prob 1
python src/map_graph.py en.emb.fifu it.emb.fifu en.map.txt it.map.txt --init_keep_prob 1

On the Tesla P100, both eager and graph mode execution finish after 122 epochs with objective = 50.5494%. In both modes, a single epoch takes ~0.4 seconds. Both the number of epochs and the score match the original implementation.

On ROCm/Radeon VII, eager execution finishes after 123 epochs with objective = 50.5478%, taking about 1.4 seconds per epoch. Graph execution finishes after 102 epochs with objective = 50.5154% with ~0.5 seconds per epoch.

When evaluated, all three models produced on the Tesla P100 receive the same scores (original, eager, graph) while the two models trained on the Radeon VII end up with lower/different scores. The evaluation is not described in detail and not necessary to reproduce the issue.

Hi @sebpuetz,

thanks for the bug report. I tried to reproduce this issue, but I had to jump through some more hoops. So I am documenting my procedure:

docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v $HOME/dockerx:/dockerx rocm/tensorflow:rocm2.7-tf1.14-dev
curl https://sh.rustup.rs -sSf | sh -s -- -y
source $HOME/.cargo/env
cargo install finalfusion-utils
python3 -m pip install finalfusion

I could use the test case setup, as you documented:

git clone https://github.com/sebpuetz/cross_lingual_embeddings -b graph
cd cross_lingual_embeddings
wget http://ixa2.si.ehu.es/martetxe/vecmap/en.emb.txt.gz
wget http://ixa2.si.ehu.es/martetxe/vecmap/it.emb.txt.gz
gunzip en.emb.txt.gz
gunzip it.emb.txt.gz
ff-convert -f textdims --lossy en.emb.txt en.emb.fifu
ff-convert -f textdims --lossy it.emb.txt it.emb.fifu

I ran the eager test on my Vega 64 (gfx900, Vega 10):

HIP_VISIBLE_DEVICES=1 python3 src/map.py en.emb.fifu it.emb.fifu en.map.txt it.map.txt --init_keep_prob 1
[...]
Finished after 107 iterations and 4.958332769076029 minutes.
    - Objective:          50.5279%

After result printing, I get an error. Could you maybe fix this in your script?

Traceback (most recent call last):
  File "src/map.py", line 337, in <module>
    main()
  File "src/map.py", line 332, in main
    write_embeddings(args.src_output, src_matrix, src_vocab)
  File "src/map.py", line 261, in write_embeddings
    print(word + ' ' + ' '.join(['%.6g' % x for x in matrix[i]]), file=f)
UnicodeEncodeError: 'ascii' codec can't encode character '\xa3' in position 0: ordinal not in range(128)

I tried running the eager test on my Fury X (gfx803, Fiji XT), but I cannot run the script because it needs to allocate more than the available memory. Could you maybe tune this for 4 GB cards?

HIP_VISIBLE_DEVICES=0 python3 src/map.py en.emb.fifu it.emb.fifu en.map.txt it.map.txt --init_keep_prob 1
[...]
Traceback (most recent call last):
  File "src/map.py", line 337, in <module>
    main()
  File "src/map.py", line 297, in main
    args.csls_k, "fwd")
  File "src/map.py", line 223, in build_dict
    simfwd = tf.matmul(a[i:j], b[:b_size], transpose_b=True)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/dispatch.py", line 180, in wrapper
    return target(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/math_ops.py", line 2647, in matmul
    a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 5915, in mat_mul
    _six.raise_from(_core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[20000,20000] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:MatMul] name: MatMul/

I ran the eager mode test on my Threadripper 1950X (2 x Zeppelin), the result is almost identical in objective and time as the Vega 64:

HIP_VISIBLE_DEVICES=-1 python3 src/map.py en.emb.fifu it.emb.fifu en.map.txt it.map.txt --init_keep_prob 1
[...]
Finished after 112 iterations and 4.960359720389048 minutes.
    - Objective:          50.5500%

I ran the graph mode test on my Vega 64, which runs far more quickly and reaches a similar objective:

HIP_VISIBLE_DEVICES=1 python3 src/map.py en.emb.fifu it.emb.fifu en.map.txt it.map.txt --init_keep_prob 1
[...]
Finished after 115 iterations and 2.1389734148979187 minutes.
    - Objective:          50.5447%

Again, the graph mode test does not work on my Fury X due to insufficient memory:

HIP_VISIBLE_DEVICES=0 python3 src/map_graph.py en.emb.fifu it.emb.fifu en.map.txt it.map.txt --init_keep_prob 1
[...]
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1356, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1341, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.

Running the graph mode test on my Threadripper 1950X, I see largely consistent results:

HIP_VISIBLE_DEVICES=-1 python3 src/map_graph.py en.emb.fifu it.emb.fifu en.map.txt it.map.txt --init_keep_prob 1
[...]
Finished after 112 iterations and 5.159188250700633 minutes.
    - Objective:          50.5500%

So yes, I can confirm this issue. Eager mode takes less iterations but more time than graph mode on Vega 64. The consistent iteration counts and objective results on my Threadripper CPU suggest that the implementation of both modes is indeed equivalent.

Hi, thanks for looking into this!

After result printing, I get an error. Could you maybe fix this in your script?

Sure, although this is not related to the issue. Seems like the encoding is incorrect in the docker-container. I pushed an updated version to the graph branch.

I tried running the eager test on my Fury X (gfx803, Fiji XT), but I cannot run the script because it needs to allocate more than the available memory. Could you maybe tune this for 4 GB cards?

For the eager implementation there is a --batch_size flag, given that 20k fits into 16GB, --batch_size 5000 might do the trick, although the required memory maybe doesn't scale linearly with batch size which would entail trying smaller values, too.

I don't have a batched implementation for graph-mode, so in order to compare the implementations on the small GPU, the model size needs to be decreased by setting --vocab_cutoff to a smaller value, e.g. --vocab_cutoff 5000 takes up about 13% of my 16GB VRAM in both modes.

With the small model size, I'm getting the following results on the GPU in eager mode:

python3 src/map.py en.emb.fifu it.emb.fifu en.map.txt it.map.txt --init_keep_prob 1 --vocab_cutoff 5000
Finished after 117 iterations and 0.5067857027053833 minutes.
    - Objective:          56.3423%

and for graph mode:

python3 src/map_graph.py en.emb.fifu it.emb.fifu en.map.txt it.map.txt --init_keep_prob 1 --vocab_cutoff 5000
Finished after 118 iterations and 0.8943199594815572 minutes.
    - Objective:          56.3227%

compared to results on CPU (2700X) in eager mode:

HIP_VISIBLE_DEVICES=-1 python3 src/map.py en.emb.fifu it.emb.fifu en.map.txt it.map.txt --init_keep_prob 1 --vocab_cutoff 5000
Finished after 117 iterations and 0.8430814941724142 minutes.
    - Objective:          56.3423%

and graph mode:

HIP_VISIBLE_DEVICES=-1 python3 src/map_graph.py en.emb.fifu it.emb.fifu en.map.txt it.map.txt --init_keep_prob 1 --vocab_cutoff 5000
Finished after 117 iterations and 0.8002505819002788 minutes.
    - Objective:          56.3423%

So it looks like differences persist with smaller models: On the CPU the implementations match and there's only small differences in time. On the GPU, the results are slightly different and there's a sizable mismatch in runtime.

Hi,

thanks for the fix. I do not see the unicode error anymore.

Results for the reduced model size on Vega 64 are similar to yours on VII in eager mode:

HIP_VISIBLE_DEVICES=1 python3 src/map.py en.emb.fifu it.emb.fifu en.map.txt it.map.txt --init_keep_prob 1 --vocab_cutoff 5000
[..]
Finished after 120 iterations and 0.5807252566019694 minutes.
    - Objective:          56.3387%

... and graph mode:

HIP_VISIBLE_DEVICES=1 python3 src/map_graph.py en.emb.fifu it.emb.fifu en.map.txt it.map.txt --init_keep_prob 1 --vocab_cutoff 5000
[...]
Finished after 120 iterations and 1.0126696745554606 minutes.
    - Objective:          56.3387%

Fury X yields better objective result, which might hint at some bug or optimization only present in its code paths in eager mode:

HIP_VISIBLE_DEVICES=0 python3 src/map.py en.emb.fifu it.emb.fifu en.map.txt it.map.txt --init_keep_prob 1 --vocab_cutoff 5000
[...]
Finished after 124 iterations and 0.608821705977122 minutes.
    - Objective:          64.1413%

... but not in graph mode:

HIP_VISIBLE_DEVICES=0 python3 src/map_graph.py en.emb.fifu it.emb.fifu en.map.txt it.map.txt --init_keep_prob 1 --vocab_cutoff 5000
[...]
Finished after 127 iterations and 1.1048914949099222 minutes.
    - Objective:          56.3503%

Threadripper 1950X yields results matching the 2700X's objective and a bit more runtime performance, as expected, in eager mode:

HIP_VISIBLE_DEVICES=-1 python3 src/map.py en.emb.fifu it.emb.fifu en.map.txt it.map.txt --init_keep_prob 1 --vocab_cutoff 5000
[...]
Finished after 117 iterations and 0.5955867727597555 minutes.
    - Objective:          56.3423%

... and graph mode:

HIP_VISIBLE_DEVICES=-1 python3 src/map_graph.py en.emb.fifu it.emb.fifu en.map.txt it.map.txt --init_keep_prob 1 --vocab_cutoff 5000
[...]
Finished after 117 iterations and 0.5778146624565125 minutes.
    - Objective:          56.3423%

So, yes I can confirm this issue affecting Fury X, too. it shows the same behavior as Vega 64 and VII, being much faster in eager mode than in graph mode.

ROCm / tensorflow-upstream

Inconsistent results with ROCm/Radeon VII #622