ROCm / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
http://pytorch.org
Other
219 stars 54 forks source link

Docker image with precompiled Pytorch fails on unit tests, gfx900 #356

Open Citronnade opened 5 years ago

Citronnade commented 5 years ago

🐛 Bug

The docker image provided in https://github.com/ROCmSoftwarePlatform/pytorch/wiki/Running-PyTorch-on-ROCm experiences a segmentation fault on test_attribute_deletion (test_autograd.TestAutograd).

To Reproduce

Steps to reproduce the behavior:

  1. Follow instructions at https://github.com/ROCmSoftwarePlatform/pytorch/wiki/Running-PyTorch-on-ROCm to download and start a container with pytorch.
  2. Clone the ROCm pytorch repository from https://github.com/ROCmSoftwarePlatform/pytorch.git
  3. Run PYTORCH_TEST_WITH_ROCM=1 python test/run_test.py --verbose inside the cloned repositroy (there is no ~/pytorch on the container I downloaded).

Test output:

root@5965cb29c0ff:/data/pytorch# PYTORCH_TEST_WITH_ROCM=1 python test/run_test.py --verbose
Test executor: ['/usr/bin/python']
Excluding c10d on ROCm
Excluding cpp_extensions on ROCm
Excluding distributed on ROCm
Excluding multiprocessing on ROCm
Excluding multiprocessing_spawn on ROCm
Excluding nccl on ROCm
Excluding thd_distributed on ROCm
Selected tests: autograd, cuda, cuda_primary_ctx, dataloader, distributions, docs_coverage, expecttest, indexing, indexing_cuda, jit, nn, numba_integration, optim, sparse, torch, type_info, type_hints, utils, namedtuple_return_api
Running test_autograd ... [2019-02-26 18:29:39.514638]
test___getitem__ (test_autograd.TestAutograd) ... ok
test___getitem___adv_index (test_autograd.TestAutograd) ... ok
test___getitem___adv_index_beg (test_autograd.TestAutograd) ... ok
test___getitem___adv_index_comb (test_autograd.TestAutograd) ... ok
test___getitem___adv_index_dup (test_autograd.TestAutograd) ... ok
test___getitem___adv_index_end (test_autograd.TestAutograd) ... ok
test___getitem___adv_index_mid (test_autograd.TestAutograd) ... ok
test___getitem___adv_index_sub (test_autograd.TestAutograd) ... ok
test___getitem___adv_index_sub_2 (test_autograd.TestAutograd) ... ok
test___getitem___adv_index_sub_3 (test_autograd.TestAutograd) ... ok
test___getitem___adv_index_var (test_autograd.TestAutograd) ... ok
test___getitem___slice (test_autograd.TestAutograd) ... ok
test___getitem___slice_index (test_autograd.TestAutograd) ... ok
test___radd___constant (test_autograd.TestAutograd) ... ok
test___radd___scalar_constant (test_autograd.TestAutograd) ... ok
test___rdiv___constant (test_autograd.TestAutograd) ... ok
test___rdiv___scalar_constant (test_autograd.TestAutograd) ... ok
test___rmul___constant (test_autograd.TestAutograd) ... ok
test___rmul___scalar_constant (test_autograd.TestAutograd) ... ok
test___rpow___constant (test_autograd.TestAutograd) ... ok
test___rpow___scalar_constant (test_autograd.TestAutograd) ... ok
test___rsub___constant (test_autograd.TestAutograd) ... ok
test___rsub___scalar_constant (test_autograd.TestAutograd) ... ok
test_abs (test_autograd.TestAutograd) ... ok
test_abs_scalar (test_autograd.TestAutograd) ... ok
test_accumulate_grad (test_autograd.TestAutograd) ... ok
test_acos (test_autograd.TestAutograd) ... ok
test_add (test_autograd.TestAutograd) ... ok
test_add_broadcast_all (test_autograd.TestAutograd) ... ok
test_add_broadcast_lhs (test_autograd.TestAutograd) ... ok
test_add_broadcast_rhs (test_autograd.TestAutograd) ... ok
test_add_constant (test_autograd.TestAutograd) ... ok
test_add_scalar (test_autograd.TestAutograd) ... ok
test_add_scalar_broadcast_lhs (test_autograd.TestAutograd) ... ok
test_add_scalar_broadcast_rhs (test_autograd.TestAutograd) ... ok
test_add_scalar_constant (test_autograd.TestAutograd) ... ok
test_addbmm (test_autograd.TestAutograd) ... ok
test_addbmm_broadcast_lhs (test_autograd.TestAutograd) ... ok
test_addbmm_broadcast_lhs_coef (test_autograd.TestAutograd) ... ok
test_addbmm_coef (test_autograd.TestAutograd) ... ok
test_addbmm_scalar_broadcast_lhs (test_autograd.TestAutograd) ... ok
test_addbmm_scalar_broadcast_lhs_coef (test_autograd.TestAutograd) ... ok
test_addcdiv (test_autograd.TestAutograd) ... ok
test_addcdiv_broadcast_all (test_autograd.TestAutograd) ... ok
test_addcdiv_broadcast_rhs (test_autograd.TestAutograd) ... ok
test_addcdiv_scalar (test_autograd.TestAutograd) ... ok
test_addcdiv_scalar_broadcast_lhs (test_autograd.TestAutograd) ... ok
test_addcdiv_scalar_broadcast_rhs (test_autograd.TestAutograd) ... ok
test_addcdiv_scalar_scale (test_autograd.TestAutograd) ... ok
test_addcdiv_scalar_scale_broadcast_lhs (test_autograd.TestAutograd) ... ok
test_addcdiv_scalar_scale_broadcast_rhs (test_autograd.TestAutograd) ... ok
test_addcdiv_scale (test_autograd.TestAutograd) ... ok
test_addcdiv_scale_broadcast_all (test_autograd.TestAutograd) ... ok
test_addcdiv_scale_broadcast_rhs (test_autograd.TestAutograd) ... ok
test_addcmul (test_autograd.TestAutograd) ... ok
test_addcmul_broadcast_all (test_autograd.TestAutograd) ... ok
test_addcmul_broadcast_rhs (test_autograd.TestAutograd) ... ok
test_addcmul_scalar (test_autograd.TestAutograd) ... ok
test_addcmul_scalar_broadcast_lhs (test_autograd.TestAutograd) ... ok
test_addcmul_scalar_broadcast_rhs (test_autograd.TestAutograd) ... ok
test_addcmul_scalar_scale (test_autograd.TestAutograd) ... ok
test_addcmul_scalar_scale_broadcast_lhs (test_autograd.TestAutograd) ... ok
test_addcmul_scalar_scale_broadcast_rhs (test_autograd.TestAutograd) ... ok
test_addcmul_scale (test_autograd.TestAutograd) ... ok
test_addcmul_scale_broadcast_all (test_autograd.TestAutograd) ... ok
test_addcmul_scale_broadcast_rhs (test_autograd.TestAutograd) ... ok
test_addmm (test_autograd.TestAutograd) ... ok
test_addmm_broadcast_lhs (test_autograd.TestAutograd) ... ok
test_addmm_broadcast_lhs_coef (test_autograd.TestAutograd) ... ok
test_addmm_coef (test_autograd.TestAutograd) ... ok
test_addmm_scalar_broadcast_lhs (test_autograd.TestAutograd) ... ok
test_addmm_scalar_broadcast_lhs_coef (test_autograd.TestAutograd) ... ok
test_addmv (test_autograd.TestAutograd) ... ok
test_addmv_broadcast_lhs (test_autograd.TestAutograd) ... ok
test_addmv_broadcast_lhs_coef (test_autograd.TestAutograd) ... ok
test_addmv_coef (test_autograd.TestAutograd) ... ok
test_addmv_scalar_broadcast_lhs (test_autograd.TestAutograd) ... ok
test_addmv_scalar_broadcast_lhs_coef (test_autograd.TestAutograd) ... ok
test_addr (test_autograd.TestAutograd) ... ok
test_addr_broadcast_lhs (test_autograd.TestAutograd) ... ok
test_addr_broadcast_lhs_coef (test_autograd.TestAutograd) ... ok
test_addr_coef (test_autograd.TestAutograd) ... ok
test_anomaly_detect_nan (test_autograd.TestAutograd) ... ok
test_as_strided (test_autograd.TestAutograd) ... ok
test_asin (test_autograd.TestAutograd) ... ok
test_atan (test_autograd.TestAutograd) ... ok
test_atan2 (test_autograd.TestAutograd) ... ok
test_atan2_broadcast_all (test_autograd.TestAutograd) ... ok
test_atan2_broadcast_lhs (test_autograd.TestAutograd) ... ok
test_atan2_broadcast_rhs (test_autograd.TestAutograd) ... ok
test_atan2_scalar (test_autograd.TestAutograd) ... ok
test_atan_scalar (test_autograd.TestAutograd) ... ok
test_attribute_deletion (test_autograd.TestAutograd) ... Segmentation fault (core dumped)

Expected behavior

Test suite passes with no failures.

Environment

Please copy and paste the output from our environment collection script (or fill out the checklist below manually).

You can get the script and run it with:

wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py

Script output:
PyTorch version: 1.1.0a0+016f212 Is debug build: No CUDA used to build PyTorch: Could not collect

OS: Ubuntu 16.04.5 LTS GCC version: Could not collect CMake version: version 3.6.3

Python version: 2.7 Is CUDA available: Yes CUDA runtime version: Could not collect GPU models and configuration: Could not collect Nvidia driver version: Could not collect cuDNN version: Could not collect

Versions of relevant libraries: [pip] numpy==1.15.4 [pip] torch==1.1.0a0+016f212 [pip] torchvision==0.2.1 [conda] Could not collect

Additional context

GPU is Vega FE (gfx900). ROCm version is 2.1.96. The MNIST example seems to run fine. I can run the microbenchmarking scripts without errors either.

ezyang commented 5 years ago

cc @iotamudelta

This is just a doc problem; not all tests run correctly and you must run with PYTORCH_TEST_WITH_ROCM=1 to skip failing tests that are known broken

iotamudelta commented 5 years ago

@Citronnade could you pull the latest docker image rocm/pytorch:rocm2.2_ubuntu16.04_pytorch and try with that? I can confirm that it passes the tests w/ PYTORCH_TEST_WITH_ROCM=1 on my Vega64 card. If you still see failures, I'd like to look at what gfx900 card you are running on. Thanks!

Citronnade commented 5 years ago

I pulled the new image and ran it using the same command as in the documentation, but just referring to the new image instead. Host system ROCm version is still 2.1.96, docker image is on 2.2.22. I no longer segfault on that test, but instead I fail another test, in test_cuda:

PYTORCH_TEST_WITH_ROCM=1 python run_test.py --verbose
Test executor: ['/usr/bin/python']
Excluding c10d on ROCm
Excluding cpp_extensions on ROCm
Excluding distributed on ROCm
Excluding multiprocessing on ROCm
Excluding multiprocessing_spawn on ROCm
Excluding nccl on ROCm
Excluding thd_distributed on ROCm
Selected tests: autograd, cuda, cuda_primary_ctx, dataloader, distributions, docs_coverage, expecttest, indexing, indexing_cuda, jit, nn, numba_integration, optim, sparse, torch, type_info, type_hints, utils, namedtuple_return_api
Running test_autograd ... [2019-03-01 08:22:52.436743]
test___getitem__ (test_autograd.TestAutograd) ... ok
test___getitem___adv_index (test_autograd.TestAutograd) ... ok
test___getitem___adv_index_beg (test_autograd.TestAutograd) ... ok
test___getitem___adv_index_comb (test_autograd.TestAutograd) ... ok
test___getitem___adv_index_dup (test_autograd.TestAutograd) ... ok

... (not included in this paste)

test_var (test_cuda.TestCuda) ... ok
test_var_large_input (test_cuda.TestCuda) ... ok
test_var_stability (test_cuda.TestCuda) ... ok
test_var_unbiased (test_cuda.TestCuda) ... ok
test_view (test_cuda.TestCuda) ... ok

======================================================================
ERROR: test_flip (test_cuda.TestCuda)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/data/pytorch/test/common_utils.py", line 296, in wrapper
    method(*args, **kwargs)
  File "/data/pytorch/test/test_cuda.py", line 2117, in test_flip
    _TestTorchMixin._test_flip(self, use_cuda=True)
  File "/data/pytorch/test/test_torch.py", line 7990, in _test_flip
    self.assertRaises(RuntimeError, lambda: data.flip(0, 1, 2, 3))
  File "/usr/lib/python2.7/unittest/case.py", line 473, in assertRaises
    callableObj(*args, **kwargs)
  File "/data/pytorch/test/test_torch.py", line 7990, in <lambda>
    self.assertRaises(RuntimeError, lambda: data.flip(0, 1, 2, 3))
IndexError: flip dims size out of range, got flip dims size=4

----------------------------------------------------------------------
Ran 154 tests in 22.765s

FAILED (errors=1, skipped=77)
Traceback (most recent call last):
  File "run_test.py", line 458, in <module>
    main()
  File "run_test.py", line 450, in main
    raise RuntimeError(message)
RuntimeError: test_cuda failed!

Updated output of collect_env.py: Collecting environment information... PyTorch version: 1.1.0a0+6706e9a Is debug build: No CUDA used to build PyTorch: Could not collect

OS: Ubuntu 16.04.5 LTS GCC version: Could not collect CMake version: version 3.6.3

Python version: 2.7 Is CUDA available: Yes CUDA runtime version: Could not collect GPU models and configuration: Could not collect Nvidia driver version: Could not collect cuDNN version: Could not collect

Versions of relevant libraries: [pip] numpy==1.16.2 [pip] torch==1.1.0a0+6706e9a [pip] torchvision==0.2.3 [conda] Could not collect

iotamudelta commented 5 years ago

@Citronnade could you confirm this is still an issue w/ ROCm 2.3? I cannot reproduce it.