Open Citronnade opened 5 years ago
cc @iotamudelta
This is just a doc problem; not all tests run correctly and you must run with PYTORCH_TEST_WITH_ROCM=1
to skip failing tests that are known broken
@Citronnade could you pull the latest docker image rocm/pytorch:rocm2.2_ubuntu16.04_pytorch
and try with that? I can confirm that it passes the tests w/ PYTORCH_TEST_WITH_ROCM=1
on my Vega64 card. If you still see failures, I'd like to look at what gfx900 card you are running on. Thanks!
I pulled the new image and ran it using the same command as in the documentation, but just referring to the new image instead. Host system ROCm version is still 2.1.96, docker image is on 2.2.22. I no longer segfault on that test, but instead I fail another test, in test_cuda:
PYTORCH_TEST_WITH_ROCM=1 python run_test.py --verbose
Test executor: ['/usr/bin/python']
Excluding c10d on ROCm
Excluding cpp_extensions on ROCm
Excluding distributed on ROCm
Excluding multiprocessing on ROCm
Excluding multiprocessing_spawn on ROCm
Excluding nccl on ROCm
Excluding thd_distributed on ROCm
Selected tests: autograd, cuda, cuda_primary_ctx, dataloader, distributions, docs_coverage, expecttest, indexing, indexing_cuda, jit, nn, numba_integration, optim, sparse, torch, type_info, type_hints, utils, namedtuple_return_api
Running test_autograd ... [2019-03-01 08:22:52.436743]
test___getitem__ (test_autograd.TestAutograd) ... ok
test___getitem___adv_index (test_autograd.TestAutograd) ... ok
test___getitem___adv_index_beg (test_autograd.TestAutograd) ... ok
test___getitem___adv_index_comb (test_autograd.TestAutograd) ... ok
test___getitem___adv_index_dup (test_autograd.TestAutograd) ... ok
... (not included in this paste)
test_var (test_cuda.TestCuda) ... ok
test_var_large_input (test_cuda.TestCuda) ... ok
test_var_stability (test_cuda.TestCuda) ... ok
test_var_unbiased (test_cuda.TestCuda) ... ok
test_view (test_cuda.TestCuda) ... ok
======================================================================
ERROR: test_flip (test_cuda.TestCuda)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/data/pytorch/test/common_utils.py", line 296, in wrapper
method(*args, **kwargs)
File "/data/pytorch/test/test_cuda.py", line 2117, in test_flip
_TestTorchMixin._test_flip(self, use_cuda=True)
File "/data/pytorch/test/test_torch.py", line 7990, in _test_flip
self.assertRaises(RuntimeError, lambda: data.flip(0, 1, 2, 3))
File "/usr/lib/python2.7/unittest/case.py", line 473, in assertRaises
callableObj(*args, **kwargs)
File "/data/pytorch/test/test_torch.py", line 7990, in <lambda>
self.assertRaises(RuntimeError, lambda: data.flip(0, 1, 2, 3))
IndexError: flip dims size out of range, got flip dims size=4
----------------------------------------------------------------------
Ran 154 tests in 22.765s
FAILED (errors=1, skipped=77)
Traceback (most recent call last):
File "run_test.py", line 458, in <module>
main()
File "run_test.py", line 450, in main
raise RuntimeError(message)
RuntimeError: test_cuda failed!
Updated output of collect_env.py
:
Collecting environment information...
PyTorch version: 1.1.0a0+6706e9a
Is debug build: No
CUDA used to build PyTorch: Could not collect
OS: Ubuntu 16.04.5 LTS GCC version: Could not collect CMake version: version 3.6.3
Python version: 2.7 Is CUDA available: Yes CUDA runtime version: Could not collect GPU models and configuration: Could not collect Nvidia driver version: Could not collect cuDNN version: Could not collect
Versions of relevant libraries: [pip] numpy==1.16.2 [pip] torch==1.1.0a0+6706e9a [pip] torchvision==0.2.3 [conda] Could not collect
@Citronnade could you confirm this is still an issue w/ ROCm 2.3? I cannot reproduce it.
🐛 Bug
The docker image provided in https://github.com/ROCmSoftwarePlatform/pytorch/wiki/Running-PyTorch-on-ROCm experiences a segmentation fault on test_attribute_deletion (test_autograd.TestAutograd).
To Reproduce
Steps to reproduce the behavior:
PYTORCH_TEST_WITH_ROCM=1 python test/run_test.py --verbose
inside the cloned repositroy (there is no ~/pytorch on the container I downloaded).Test output:
Expected behavior
Test suite passes with no failures.
Environment
Please copy and paste the output from our environment collection script (or fill out the checklist below manually).
You can get the script and run it with:
Script output:
PyTorch version: 1.1.0a0+016f212 Is debug build: No CUDA used to build PyTorch: Could not collect
OS: Ubuntu 16.04.5 LTS GCC version: Could not collect CMake version: version 3.6.3
Python version: 2.7 Is CUDA available: Yes CUDA runtime version: Could not collect GPU models and configuration: Could not collect Nvidia driver version: Could not collect cuDNN version: Could not collect
Versions of relevant libraries: [pip] numpy==1.15.4 [pip] torch==1.1.0a0+016f212 [pip] torchvision==0.2.1 [conda] Could not collect
Additional context
GPU is Vega FE (gfx900). ROCm version is 2.1.96. The MNIST example seems to run fine. I can run the microbenchmarking scripts without errors either.