ROCm / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
http://pytorch.org
Other
219 stars 51 forks source link

Unable to set up docker image (segmentation fault when importing torch) #668

Open skerit opened 4 years ago

skerit commented 4 years ago

🐛 Bug

Following the guide to the letter, I'm unable to compile pytorch

To Reproduce

Steps to reproduce the behavior:

  1. Follow the guide at https://github.com/ROCmSoftwarePlatform/pytorch/wiki/Building-PyTorch-for-ROCm
  2. Fail

There were a few issues along the way (like I had to install the libidn11 package and the pip version that was supplied with the docker image was too old, because it didn't support the --progress-bar option)

But after a long time of compiling, I was finally ready to do the ci tests & install the actual damn thing, when I got this error:

 (7.2.0)
+ pip_install --user mypy
+ pip install --progress-bar off --user mypy
Requirement already satisfied: mypy in /root/.local/lib/python3.6/site-packages (0.780)
Requirement already satisfied: typed-ast<1.5.0,>=1.4.0 in /root/.local/lib/python3.6/site-packages (from mypy) (1.4.1)
Requirement already satisfied: typing-extensions>=3.7.4 in /usr/local/lib/python3.6/dist-packages (from mypy) (3.7.4)
Requirement already satisfied: mypy-extensions<0.5.0,>=0.4.3 in /root/.local/lib/python3.6/site-packages (from mypy) (0.4.3)
++ /usr/bin/python3.6 -c 'import sys; print(int(sys.version_info >= (3, 3)))'
+ [[ ! 1 == \1 ]]
+ [[ py3.6-clang7-rocmdeb-ubuntu18.04 == *asan* ]]
+ [[ py3.6-clang7-rocmdeb-ubuntu18.04 == *-NO_AVX-* ]]
+ [[ py3.6-clang7-rocmdeb-ubuntu18.04 == *-NO_AVX2-* ]]
+ '[' -n '' ']'
+ [[ py3.6-clang7-rocmdeb-ubuntu18.04 == *libtorch* ]]
+ cd test
+ /usr/bin/python3.6 -c 'import torch; print(torch.__config__.show())'
.jenkins/pytorch/test.sh: line 257: 107176 Segmentation fault      (core dumped) /usr/bin/python3.6 -c "import torch; print(torch.__config__.show())"
+ cleanup
+ retcode=139
+ set +x

So, a segmentation fault as soon as torch is imported?

Expected behavior

I expect the steps to just work, I don't see why a guide for a docker image can't be reproduced.

Environment

Additional context

Laser-Cat commented 4 years ago

My guessing is this problem may has something to do with Navi(including 5700xt). Navi is not supported by rocm (yet?)

https://github.com/RadeonOpenCompute/ROCm/issues/887

skerit commented 4 years ago

Oh really? That's a shame :(

devksingh4 commented 4 years ago

I have also previously encountered this error on an RX580.