ROCm / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
http://pytorch.org
Other
219 stars 55 forks source link

"Undefined __global__ function" on method calls from `torch.nn.functional` #572

Open aclex opened 4 years ago

aclex commented 4 years ago

Continued from https://github.com/ROCm-Developer-Tools/HIP/issues/1827 as suggested to file a question here.

I'm experimenting with creating proper Gentoo ebuild for PyTorch system-wide installation, including ROCm support, here: https://github.com/aclex/pytorch-ebuild. The building itself finishes successfully, but I found, that calls to pretty much any function in torch.nn.functional finishes with RuntimeError: Undefined __global__ function. Minimal reproducing code is like this:

import torch
import torch.nn.functional as F

x = torch.tensor([1., 2.], device="cuda")
v = F.relu(x)

All the CUDA tests are failing as well. I've tried PyTorch of versions 1.3.1 and 1.4.0 building against ROCm-3.0 installed using ebuilds from https://github.com/justxi/rocm.

Version built in the Docker container made from the official image(Ubuntu 16.04, ROCm 3.0) works fine, all the tests are passed.

I surely realize, that this is completely unsupported way that I'm trying to go, but could you please suggest me maybe any approaches to debug the problem? Thank you very much in advance!

s-bernard commented 4 years ago

I get the same error when compiled on Archlinux and installed for the system.

But, the same build, if it is install with --user, it works.

s-bernard commented 4 years ago

Well, it works, I fell on #639 .

s-bernard commented 4 years ago

@aclex Did you check if HIP examples work or do they fail like described in https://github.com/rocm-arch/rocm-arch/issues/138? I think we have the same problems on Archlinux as you on Gentoo to correctly package the ROCm stack.

aclex commented 4 years ago

@s-bernard I've just run the tests on my machine and the reduction tests failing the same way you've described in https://github.com/rocm-arch/rocm-arch/issues/138

==== reduction ====
rm -f reduction *.o
/usr/lib/hip/bin/hipcc -std=c++11 -O3 -o reduction reduction.cpp 
./reduction 1024*1024*4
ARRAYSIZE: 1024
Array size: 0.00390625 MB
The average performance of reduction is 0.000275685 GBytes/sec
VERIFICATION: result is INCORRECT!!

./reduction 8388608
ARRAYSIZE: 8388608
Array size: 32 MB
The average performance of reduction is 2.1926 GBytes/sec
VERIFICATION: result is INCORRECT!!

./reduction 16777216
ARRAYSIZE: 16777216
Array size: 64 MB
The average performance of reduction is 4.40567 GBytes/sec
VERIFICATION: result is INCORRECT!!

./reduction 33554432
ARRAYSIZE: 33554432
Array size: 128 MB
The average performance of reduction is 8.47254 GBytes/sec
VERIFICATION: result is INCORRECT!!

./reduction 67108864
ARRAYSIZE: 67108864
Array size: 256 MB
The average performance of reduction is 16.9477 GBytes/sec
VERIFICATION: result is INCORRECT!!

./reduction 134217728
ARRAYSIZE: 134217728
Array size: 512 MB
The average performance of reduction is 33.5576 GBytes/sec
VERIFICATION: result is INCORRECT!!

./reduction 268435456
ARRAYSIZE: 268435456
Array size: 1024 MB
The average performance of reduction is 64.4322 GBytes/sec
VERIFICATION: result is INCORRECT!!

./reduction 536870912
ARRAYSIZE: 536870912
Array size: 2048 MB
The average performance of reduction is 120.708 GBytes/sec
VERIFICATION: result is INCORRECT!!

Apart from that b+tree test from Rodinia suite fails, too:

\033[0;35m--TESTING: b+tree\033[0m
executing: ../../test/b+tree/run0.cmd...      \033[0;31mFAILED!\033[0m

Other tests seem to be passed.

I've also prepared Docker image on top of clean stage3 of Gentoo with ROCm packages installed to experiment with this packaging problem, but haven't had a chance to work at it yet.

How have you managed to fix this "Undefined global function" problem on Arch by the way?

s-bernard commented 4 years ago

We did not. By installing PyTorch as a user, I don't have the global function error but another one. The failing tests on HIP tells me there is something wrong in HIP.

By the way, the global function error is not specific to PyTorch, you can found it there: https://github.com/ROCmSoftwarePlatform/rocFFT/issues/267. I do have the same error on rocfft, another thread pointed me to test HIP which is failing so I'll start investigating HIP.

aclex commented 4 years ago

@s-bernard I think, there're some special settings in PyTorch when installed by setup.py as a user, I tried to experiment them, but haven't turned to it yet. So probably you overcome this particular bug with the user installation, and those new one is something separate. But yes, HIP is obviously partially broken, according to tests. By the way, I've seen official Docker images files somewhere, I think, you can try to reproduce HIP packages installation from there to find the difference.

s-bernard commented 4 years ago

Yes I have used the docker image to compare with the system. It actually works with some glitches (it crashed my graphic card on a resnet50 training) in the docker.

s-bernard commented 4 years ago

Hi, I solved the reduction test by just not stripping the final library. However, I still can't make the included tests pass.

aclex commented 4 years ago

@s-bernard thanks for the details! I'll publish an issue for Gentoo ROCm stack to try fixing this reduction test issue, too. What are the tests failing on your setup currently? Or do you mean PyTorch tests?

aclex commented 4 years ago

@s-bernard could you please suggest, which binary being non-stripped solves the reduction tests failure for you? I've checked my setup, the reduction binary in the tests itself is not stripped, though the tests are failing. Do you mean some HIP-related binary?

s-bernard commented 4 years ago

@aclex I'm tring to pass the HIP tests (tests directory in HIP repository). Most of them are working but some are still failing. https://github.com/ROCm-Developer-Tools/HIP/issues/2078

The binary is the main HIP shared library: libhip_hcc.so.3.3.X. Actually, I think stripping causes a problem because the lib contains GPU code which seems unused.

s-bernard commented 4 years ago

To be sure, I just don't strip any ROCm binaries.

aclex commented 4 years ago

@s-bernard thank you very much! Yes, indeed, given these subtle issues, it's probably safer to disable it mainly. Will try to disable strip and restart the tests on my setup. Thanks again!

And what tests you still can't manage to pass? Is it Pytorch internal tests? Have you tried version 1.5 by the way? I can't manage to build it properly yet (Python part is not installed somehow), but I noticed a lot of changes related to ROCm there, maybe, there're some improvements as well.

s-bernard commented 4 years ago

@aclex I'm tring to pass the HIP tests (tests directory in HIP repository). Most of them are working but some are still failing. ROCm-Developer-Tools/HIP#2078

s-bernard commented 4 years ago

I won't try to make pytorch work until I successfully pass all HIP tests.

aclex commented 4 years ago

@s-bernard I see, thanks for clarification! As I see in the reports you've referenced, this looks like a significant stuck. I'll try to try your results on Gentoo ROCm stack also, who knows, maybe there's some new information appear, but I believe it would fail just the same way.

aclex commented 4 years ago

@s-bernard I've just checked out on my system and can confirm your results of reduction tests pass on non-strip HIP binaries. Here're the two logs, with stripped and non-stripped HIP binaries installed: https://gist.github.com/aclex/5349be5f13ea1892b409e43c645c57ca Still b+tree test fails in both cases on my setup.