ROCm / tensorflow-upstream

TensorFlow ROCm port
https://tensorflow.org
Apache License 2.0
687 stars 93 forks source link

EfficientNet inference yields incorrect results on GPU #519

Closed liamnr2 closed 3 years ago

liamnr2 commented 5 years ago

I'm using rocm 2.5, tensorflow-rocm 1.13.3 and python 3.6 with a RX 470.

When running the simple EfficientNet-B0 inference example here: https://github.com/qubvel/efficientnet/blob/master/examples/inference_example.ipynb

the inference of the example image yields incorrect and non-deterministic results. Some examples: [[('n01773549', 'barn_spider', 0.4544877), ('n01776313', 'tick', 0.14279026), ('n03271574', 'electric_fan', 0.06995272), ('n01774750', 'tarantula', 0.059890375), ('n01531178', 'goldfinch', 0.04341215)]]

[[('n01776313', 'tick', 0.52216125), ('n01773549', 'barn_spider', 0.24521892), ('n03271574', 'electric_fan', 0.17396057), ('n01774750', 'tarantula', 0.015509106), ('n03982430', 'pool_table', 0.008740869)]]

[[('n02497673', 'Madagascar_cat', 0.24683656), ('n03976657', 'pole', 0.20120004), ('n03710721', 'maillot', 0.078447856), ('n01773549', 'barn_spider', 0.046732053), ('n01774750', 'tarantula', 0.04341184)]]

When forcing to run on the CPU via CUDA_VISIBLE_DEVICES=, it yields the expected result: [[('n02510455', 'giant_panda', 0.8347932), ('n02134084', 'ice_bear', 0.015602067), ('n02509815', 'lesser_panda', 0.0045535103), ('n02133161', 'American_black_bear', 0.0024719117), ('n02132136', 'brown_bear', 0.0020707578)]]

Bengt commented 5 years ago

Hi, @liamnr2!

Welcome to GitHub and thanks for reporting this issue. Seemingly random results are hard to test for so it is very valuable that you found some.

Unfortunately, the RX470 uses a Polaris 10 chip, which shares the gfx803 compile target with a bunch of other popular GPUs. For a list of the affected GPUs see #479.

There have been quite a number of issues with this compile target, only some of which could be resolved, yet. For a full list see the gfx803 tag.

To find the cause of this behavior, we need to reproduce these issues with various combinations of hardware and software. I can try and help with creating a reproducing procedure.

A wild guess would be to try downgrading rocm-opencl, which has helped with gfx803 in some cases:

https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/issues/300#issuecomment-459020227 https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/issues/302#issuecomment-459019797

Bengt commented 5 years ago

Procedure for reproduction:

docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v $HOME/dockerx:/dockerx rocm/tensorflow:rocm2.5-tf1.13-python3
python3 -m pip install scikit-image numpy keras efficientnet pytest
wget https://upload.wikimedia.org/wikipedia/commons/f/fe/Giant_Panda_in_Beijing_Zoo_1.JPG
wget https://gist.githubusercontent.com/Bengt/308c7d05dc755f1bfe0aeda9220e4eed/raw//test_efficientnet_gfx803.py
HIP_VISIBLE_DEVICES=0 python3 -m pytest -s test_efficientnet_gfx803.py
HIP_VISIBLE_DEVICES=-1 python3 -m pytest -s test_efficientnet_gfx803.py
Bengt commented 5 years ago

I can reproduce this issue.

Using GPU 0 fails:

# HIP_VISIBLE_DEVICES=0 python3 -m pytest -s test_efficientnet_gfx803.py
[...]
>       assert actual == expected
E       AssertionError: assert ['bow_tie', '... 'guinea_pig'] == ['giant_panda'... 'brown_bear']
E         At index 0 diff: 'bow_tie' != 'giant_panda'

Using GPU 1 fails:

# HIP_VISIBLE_DEVICES=1 python3 -m pytest -s test_efficientnet_gfx803.py
[...]
>       assert actual == expected
E       AssertionError: assert ['crutch', 't...ra', 'sorrel'] == ['giant_panda'... 'brown_bear']
E         At index 0 diff: 'crutch' != 'giant_panda'

Using GPU 2 fails:

# HIP_VISIBLE_DEVICES=2 python3 -m pytest -s test_efficientnet_gfx803.py
[...]
>       assert actual == expected
E       AssertionError: assert ['jersey', 'w...an_coonhound'] == ['giant_panda'... 'brown_bear']
E         At index 0 diff: 'jersey' != 'giant_panda'

Using GPU 3 fails:

# HIP_VISIBLE_DEVICES=3 python3 -m pytest -s test_efficientnet_gfx803.py
[...]
>       assert actual == expected
E       AssertionError: assert ['bolo_tie', ...analog_clock'] == ['giant_panda'... 'brown_bear']
E         At index 0 diff: 'bolo_tie' != 'giant_panda'

These results seem indeed random or undeterministic:

# HIP_VISIBLE_DEVICES=3 python3 -m pytest -s test_efficientnet_gfx803.py
[...]
>       assert actual == expected
E       AssertionError: assert ['oxygen_mask...er', 'maraca'] == ['giant_panda'... 'brown_bear']
E         At index 0 diff: 'oxygen_mask' != 'giant_panda'

Using CPU works fine:

# HIP_VISIBLE_DEVICES=-1 python3 -m pytest -s test_efficientnet_gfx803.py
[...]
===================== 1 passed, 2 warnings in 9.32 seconds =====================

I am using R9 Fury X and R9 Nano GPUs, latest Ubuntu Kernel and ROCm 2.5.27:

$ lspci -v | grep VGA
09:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Fiji [Radeon R9 FURY / NANO Series] (rev ca) (prog-if 00 [VGA controller])
0a:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Fiji [Radeon R9 FURY / NANO Series] (rev ca) (prog-if 00 [VGA controller])
42:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Fiji [Radeon R9 FURY / NANO Series] (rev ca) (prog-if 00 [VGA controller])
43:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Fiji [Radeon R9 FURY / NANO Series] (rev c8) (prog-if 00 [VGA controller])
$ uname -r
4.15.0-54-generic
$ $ dpkg -l | grep rocm | grep stack
ii  rocm-dev                                      2.5.27                                       amd64        Radeon Open Compute (ROCm) Runtime software stack
ii  rocm-dkms                                     2.5.27                                       amd64        Radeon Open Compute (ROCm) Runtime software stack
ii  rocm-libs                                     2.5.27                                       amd64        Radeon Open Compute (ROCm) Runtime software stack
ii  rocm-utils                                    2.5.27                                       amd64        Radeon Open Compute (ROCm) Runtime software stack

Downgrading the ROCm-opencl does not help in my case:

cd ~ && mkdir rocm1.9.2-opencl && cd rocm1.9.2-opencl &&
wget https://www.dropbox.com/s/rtwe1zrpuphbyqm/rocm-opencl-1.2.0-2018111340_amd64.deb && 
wget https://www.dropbox.com/s/6gp2g5zju66i4e9/rocm-opencl-dev-1.2.0-2018111340_amd64.deb && 
dpkg -i rocm-opencl*.deb &&
rm -rf ~/.cache &&
cd  -
# HIP_VISIBLE_DEVICES=0 python3 -m pytest -s test_efficientnet_gfx803.py
[..]
>       assert actual == expected
E       AssertionError: assert ['artichoke',... 'sea_urchin'] == ['giant_panda'... 'brown_bear']
E         At index 0 diff: 'artichoke' != 'giant_panda'
Bengt commented 5 years ago

This issue persists with rocm/tensorflow:rocm1.9.2-tf1.12-python3:

# HIP_VISIBLE_DEVICES=0 python3 -m pytest -s test_efficientnet_gfx803.py
[...]
>       assert actual == expected
E       AssertionError: assert ['garter_snak...r', 'echidna'] == ['giant_panda'... 'brown_bear']
E         At index 0 diff: 'garter_snake' != 'giant_panda'
liamnr2 commented 5 years ago

Still a problem with ROCm 2.6.

As an observation, setting MIOPEN_DEBUG_GCN_ASM_KERNELS=0 improves the results - there is still jitter, but far less so. With EfficientNet-B7 it's minimal, but still there.

EfficientNet-B0, MIOPEN_DEBUG_GCN_ASM_KERNELS=1:

[[('n02510455', 'giant_panda', 0.77773875), ('n02132136', 'brown_bear', 0.01460326), ('n02134084', 'ice_bear', 0.009905247), ('n02133161', 'American_black_bear', 0.009050588), ('n02096585', 'Boston_bull', 0.0070677395)]]
[[('n06359193', 'web_site', 0.058782034), ('n03291819', 'envelope', 0.05005152), ('n04118776', 'rule', 0.044108856), ('n03998194', 'prayer_rug', 0.04122383), ('n04409515', 'tennis_ball', 0.034883693)]]
[[('n03706229', 'magnetic_compass', 0.14726971), ('n04238763', 'slide_rule', 0.11163757), ('n04118776', 'rule', 0.094091), ('n02708093', 'analog_clock', 0.027822705), ('n02794156', 'barometer', 0.02083086)]]
[[('n04118776', 'rule', 0.1692506), ('n03706229', 'magnetic_compass', 0.0850252), ('n04238763', 'slide_rule', 0.06351954), ('n02708093', 'analog_clock', 0.024399932), ('n03857828', 'oscilloscope', 0.019325882)]]
[[('n04238763', 'slide_rule', 0.06601893), ('n04118776', 'rule', 0.057043314), ('n03706229', 'magnetic_compass', 0.043703355), ('n04357314', 'sunscreen', 0.04076335), ('n03929660', 'pick', 0.035940796)]]
[[('n06359193', 'web_site', 0.049492065), ('n04118776', 'rule', 0.049231295), ('n03998194', 'prayer_rug', 0.048374362), ('n03291819', 'envelope', 0.035772696), ('n07248320', 'book_jacket', 0.033176217)]]
[[('n04238763', 'slide_rule', 0.12657635), ('n03706229', 'magnetic_compass', 0.10579053), ('n04118776', 'rule', 0.054984488), ('n04357314', 'sunscreen', 0.04215321), ('n03047690', 'clog', 0.031784806)]]
[[('n04118776', 'rule', 0.32308587), ('n04238763', 'slide_rule', 0.14665197), ('n03706229', 'magnetic_compass', 0.044921804), ('n04357314', 'sunscreen', 0.026250241), ('n02708093', 'analog_clock', 0.023987856)]]
[[('n04118776', 'rule', 0.08757503), ('n03706229', 'magnetic_compass', 0.06836976), ('n04238763', 'slide_rule', 0.06297214), ('n02708093', 'analog_clock', 0.03041393), ('n04039381', 'racket', 0.02478665)]]
[[('n04238763', 'slide_rule', 0.073604986), ('n04357314', 'sunscreen', 0.057176016), ('n04118776', 'rule', 0.055984076), ('n03706229', 'magnetic_compass', 0.05034835), ('n03929660', 'pick', 0.029202135)]]

EfficientNet-B0, MIOPEN_DEBUG_GCN_ASM_KERNELS=0:

[[('n02510455', 'giant_panda', 0.80664486), ('n02134084', 'ice_bear', 0.006699027), ('n02132136', 'brown_bear', 0.0057221507), ('n02509815', 'lesser_panda', 0.004147317), ('n02120079', 'Arctic_fox', 0.0035862043)]]
[[('n02510455', 'giant_panda', 0.75878745), ('n02134084', 'ice_bear', 0.008354737), ('n02132136', 'brown_bear', 0.007207209), ('n02509815', 'lesser_panda', 0.004130219), ('n02120079', 'Arctic_fox', 0.0040210793)]]
[[('n02510455', 'giant_panda', 0.7587877), ('n02134084', 'ice_bear', 0.008354739), ('n02132136', 'brown_bear', 0.0072072037), ('n02509815', 'lesser_panda', 0.0041302163), ('n02120079', 'Arctic_fox', 0.0040210765)]]
[[('n02510455', 'giant_panda', 0.76415765), ('n02134084', 'ice_bear', 0.008157566), ('n02132136', 'brown_bear', 0.0061342083), ('n02509815', 'lesser_panda', 0.0036074982), ('n02120079', 'Arctic_fox', 0.0035751157)]]
[[('n02510455', 'giant_panda', 0.75936085), ('n02134084', 'ice_bear', 0.008365493), ('n02132136', 'brown_bear', 0.007142773), ('n02509815', 'lesser_panda', 0.004107962), ('n02120079', 'Arctic_fox', 0.0040129614)]]
[[('n02510455', 'giant_panda', 0.75878924), ('n02134084', 'ice_bear', 0.00835698), ('n02132136', 'brown_bear', 0.0072079534), ('n02509815', 'lesser_panda', 0.004130396), ('n02120079', 'Arctic_fox', 0.0040213186)]]
[[('n02510455', 'giant_panda', 0.7603499), ('n02134084', 'ice_bear', 0.009082864), ('n02132136', 'brown_bear', 0.006688087), ('n02120079', 'Arctic_fox', 0.0040302738), ('n02509815', 'lesser_panda', 0.0038609721)]]
[[('n02510455', 'giant_panda', 0.7493819), ('n02132136', 'brown_bear', 0.008669576), ('n02134084', 'ice_bear', 0.008599169), ('n02509815', 'lesser_panda', 0.0042907814), ('n02120079', 'Arctic_fox', 0.0039218697)]]
[[('n02510455', 'giant_panda', 0.73992616), ('n02134084', 'ice_bear', 0.008566578), ('n02132136', 'brown_bear', 0.0071503706), ('n02120079', 'Arctic_fox', 0.005537635), ('n02133161', 'American_black_bear', 0.0039643333)]]
[[('n02510455', 'giant_panda', 0.48032713), ('n02114548', 'white_wolf', 0.024954954), ('n02120079', 'Arctic_fox', 0.016971268), ('n02395406', 'hog', 0.015805786), ('n02132136', 'brown_bear', 0.00848116)]]

EfficientNet-B7, MIOPEN_DEBUG_GCN_ASM_KERNELS=1:

[[('n02093256', 'Staffordshire_bullterrier', 1.0), ('n15075141', 'toilet_tissue', 0.0), ('n02319095', 'sea_urchin', 0.0), ('n02395406', 'hog', 0.0), ('n02391049', 'zebra', 0.0)]]
[[('n15075141', 'toilet_tissue', nan), ('n02319095', 'sea_urchin', nan), ('n02395406', 'hog', nan), ('n02391049', 'zebra', nan), ('n02389026', 'sorrel', nan)]]
[[('n03482405', 'hamper', 1.0), ('n15075141', 'toilet_tissue', 0.0), ('n02319095', 'sea_urchin', 0.0), ('n02391049', 'zebra', 0.0), ('n02389026', 'sorrel', 0.0)]]
[[('n13044778', 'earthstar', 1.0), ('n02317335', 'starfish', 6.7773486e-22), ('n04033901', 'quill', 3.4295856e-33), ('n02391049', 'zebra', 0.0), ('n02389026', 'sorrel', 0.0)]]
[[('n03379051', 'football_helmet', 1.0), ('n15075141', 'toilet_tissue', 0.0), ('n02281787', 'lycaenid', 0.0), ('n02389026', 'sorrel', 0.0), ('n02364673', 'guinea_pig', 0.0)]]
[[('n07892512', 'red_wine', 1.0), ('n02317335', 'starfish', 0.0), ('n02391049', 'zebra', 0.0), ('n02389026', 'sorrel', 0.0), ('n02364673', 'guinea_pig', 0.0)]]
[[('n04447861', 'toilet_seat', 1.0), ('n15075141', 'toilet_tissue', 0.0), ('n02317335', 'starfish', 0.0), ('n02391049', 'zebra', 0.0), ('n02389026', 'sorrel', 0.0)]]
[[('n03314780', 'face_powder', 1.0), ('n15075141', 'toilet_tissue', 0.0), ('n02281787', 'lycaenid', 0.0), ('n02389026', 'sorrel', 0.0), ('n02364673', 'guinea_pig', 0.0)]]
[[('n02804610', 'bassoon', 1.0), ('n02841315', 'binoculars', 2.5764785e-13), ('n04099969', 'rocking_chair', 5.0266332e-29), ('n02328150', 'Angora', 0.0), ('n02317335', 'starfish', 0.0)]]
[[('n03887697', 'paper_towel', 1.0), ('n15075141', 'toilet_tissue', 0.0), ('n02281787', 'lycaenid', 0.0), ('n02389026', 'sorrel', 0.0), ('n02364673', 'guinea_pig', 0.0)]]

EfficientNet-B7, MIOPEN_DEBUG_GCN_ASM_KERNELS=0:

[[('n02510455', 'giant_panda', 0.8399878), ('n02509815', 'lesser_panda', 0.003146674), ('n02133161', 'American_black_bear', 0.002262074), ('n02134084', 'ice_bear', 0.0014058463), ('n02132136', 'brown_bear', 0.0013730429)]]
[[('n02510455', 'giant_panda', 0.8399878), ('n02509815', 'lesser_panda', 0.003146674), ('n02133161', 'American_black_bear', 0.002262073), ('n02134084', 'ice_bear', 0.0014058443), ('n02132136', 'brown_bear', 0.0013730436)]]
[[('n02510455', 'giant_panda', 0.8399879), ('n02509815', 'lesser_panda', 0.0031466729), ('n02133161', 'American_black_bear', 0.0022620677), ('n02134084', 'ice_bear', 0.0014058452), ('n02132136', 'brown_bear', 0.0013730424)]]
[[('n02510455', 'giant_panda', 0.8399878), ('n02509815', 'lesser_panda', 0.0031466756), ('n02133161', 'American_black_bear', 0.0022620752), ('n02134084', 'ice_bear', 0.001405845), ('n02132136', 'brown_bear', 0.0013730436)]]
[[('n02510455', 'giant_panda', 0.8399879), ('n02509815', 'lesser_panda', 0.0031466743), ('n02133161', 'American_black_bear', 0.00226207), ('n02134084', 'ice_bear', 0.0014058452), ('n02132136', 'brown_bear', 0.0013730417)]]
[[('n02510455', 'giant_panda', 0.839891), ('n02509815', 'lesser_panda', 0.003151415), ('n02133161', 'American_black_bear', 0.0022747808), ('n02134084', 'ice_bear', 0.0014124429), ('n02132136', 'brown_bear', 0.0013766055)]]
[[('n02510455', 'giant_panda', 0.8399878), ('n02509815', 'lesser_panda', 0.0031466784), ('n02133161', 'American_black_bear', 0.002262073), ('n02134084', 'ice_bear', 0.0014058456), ('n02132136', 'brown_bear', 0.0013730442)]]
[[('n02510455', 'giant_panda', 0.83998775), ('n02509815', 'lesser_panda', 0.0031466782), ('n02133161', 'American_black_bear', 0.002262075), ('n02134084', 'ice_bear', 0.0014058475), ('n02132136', 'brown_bear', 0.0013730454)]]
[[('n02510455', 'giant_panda', 0.8399878), ('n02509815', 'lesser_panda', 0.0031466756), ('n02133161', 'American_black_bear', 0.002262073), ('n02134084', 'ice_bear', 0.001405845), ('n02132136', 'brown_bear', 0.0013730442)]]
[[('n02510455', 'giant_panda', 0.8399878), ('n02509815', 'lesser_panda', 0.0031466756), ('n02133161', 'American_black_bear', 0.002262072), ('n02134084', 'ice_bear', 0.0014058456), ('n02132136', 'brown_bear', 0.0013730429)]]
ekuznetsov139 commented 5 years ago

FYI, not that it helps you, but it works correctly on gfx900 (Vega 10) with rocm2.6-tf1.14-python3. [[('n02510455', 'giant_panda', 0.83479327), ('n02134084', 'ice_bear', 0.015601887), ('n02509815', 'lesser_panda', 0.0045534954), ('n02133161', 'American_black_bear', 0.0024719073), ('n02132136', 'brown_bear', 0.002070747)]]

Bengt commented 5 years ago

Hi @ekuznetsov139.

thanks for the data point. While you are at it, could you rerun the test with rocm/tensorflow:rocm2.7-tf1.14-dev? That seems to be the current focus of development.

Regards, Bengt

ekuznetsov139 commented 5 years ago

It works correctly with that tag as well.

Though in both cases there is something odd: processing takes a very long time (around 1 minute) and GPU usage is near zero all that time. (It definitely uses the GPU, I've confirmed with HIP_TRACE_API.) Not sure if it's an anomaly or it's just that EfficientNet is not being very efficient.

Bengt commented 5 years ago

Hi, to add another data point, I can confirm this working using gfx900 (Vega 64, Vega 10). So the issue seems to affect gfx803, only. Having an eye on the card's GPUTach, I also noticed long idle times during the test run.

huanzhang12 commented 4 years ago

I found that the issue is caused by the ASM 1x1 kernel on gfx803: https://github.com/ROCmSoftwarePlatform/MIOpen/blob/master/src/kernels/conv1x1u.s On gfx803, I can obtain the same result as on gfx906 by disabling this ASM 1x1 kernel:

MIOPEN_DEBUG_CONV_DIRECT_ASM_1X1U=0 python3 -m pytest -s test_efficientnet_gfx803.py

Recently, to avoid issues like this one all ASM convolution kernels have been disabled on gfx803 (See https://github.com/ROCmSoftwarePlatform/MIOpen/commit/ce51a4c474541793a8dc3d08a2416e81e3d4d9dd) But this also significantly reduces gfx803 performance (for ResNet-50 it is almost twice slower, see https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/issues/173#issuecomment-545638754). I have a workload that becomes 10x slower on gfx803 after disabling asm kernels. I hope AMD can fix the bugs in ASM kernels and re-enable them on gfx803.

ROCmSupport commented 3 years ago

Thanks for reaching out. gfx8 is not a supported config now. We are not supporting gfx8 devices officially with ROCm and request you to follow our supported hardware section @ ROCm docs: https://github.com/RadeonOpenCompute/ROCm#Hardware-and-Software-Support