ROCm / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
http://pytorch.org
Other
219 stars 51 forks source link

Building PyTorch w/o Docker ? #337

Open gateway opened 5 years ago

gateway commented 5 years ago

Hi, Im trying to get my AMD system set up to run some torch software , I prefer not to have to mess with Docker, is there a reason to do this ?

Is there a way to build this w/o docker?

iotamudelta commented 5 years ago

Sure, make sure that you install the dependencies as listed inside the docker files and follow the subsequent steps afterwards.

iamkucuk commented 5 years ago

Sure, make sure that you install the dependencies as listed inside the docker files and follow the subsequent steps afterwards.

An installation script may be very good and helpful. I would be grateful if you could provide one for the community!

iamkucuk commented 5 years ago

Any progress on that?

Delaunay commented 5 years ago

@iotamudelta can you point me to the docker file you are referring to ? Is it that one ?

This is what I did to compile pytorch:

  1. Install pytorch dependencies, rocm-dev and a bunch of rocm libraries. CMake will gracefully tell you which are missing

  2. execute ./.jenkins/caffe2/build.sh it hipifies the caffe2 source code, generating the missing files required for the compilation. You might be able to just run python tools/amd_build/build_amd.py but I have not tried it alone.

  3. compile pytorch as usual python setup.py develop.

The compilation is still going so I am not sure if it is all I needed to do but it looks good so far. hipcc uses a lot of memory. I had a few OOM errors, made me restart with make -j 1.

iotamudelta commented 5 years ago

@Delaunay yes, that Dockerfile is part of it - I'd recommend using https://github.com/ROCmSoftwarePlatform/pytorch/blob/master/docker/caffe2/jenkins/build.sh with "py2-clang7-rocmdeb-ubuntu16.04" as the argument if you build your own docker. A standalone Dockerfile is here: https://raw.githubusercontent.com/wiki/ROCmSoftwarePlatform/pytorch/Dockerfile

Yes, just running python tools/amd_build/build_amd.py is sufficient to hipify the full source.

How much RAM do you have? A good rule of thumb seems to be MAX_JOBS=(RAM in GB)/4.

Delaunay commented 5 years ago

I only have 8Go on that machine. I was able to compile pytorch with ninja (without it the installation fails) but the version I compiled is not functional.

Would you know it is an issue with the configuration of the compilation or if the kernel is really missing ? Thanks

>>> import torch
>>> torch.cuda.is_available()
True
>>> torch.cuda.get_device_name(0)
'Ellesmere [Radeon RX 470/480/570/570X/580/580X]'
>>> torch.cuda.max_memory_allocated(0)
1024
>>> t = torch.zeros((10, 10, 10), dtype=torch.float32)
>>> t.cuda()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/setepenre/rocm_pytorch/torch/tensor.py", line 70, in __repr__
    return torch._tensor_str._str(self)
  File "/home/setepenre/rocm_pytorch/torch/_tensor_str.py", line 285, in _str
    tensor_str = _tensor_str(self, indent)
  File "/home/setepenre/rocm_pytorch/torch/_tensor_str.py", line 203, in _tensor_str
    formatter = _Formatter(get_summarized_data(self) if summarize else self)
  File "/home/setepenre/rocm_pytorch/torch/_tensor_str.py", line 89, in __init__
    nonzero_finite_vals = torch.masked_select(tensor_view, torch.isfinite(tensor_view) & tensor_view.ne(0))
  File "/home/setepenre/rocm_pytorch/torch/functional.py", line 222, in isfinite
    return (tensor == tensor) & (tensor.abs() != inf)
RuntimeError: No device code available for function: _Z21kernelPointwiseApply3I10TensorEQOpIfhEhffjLi1ELi1ELi1EEv10OffsetInfoIT0_T3_XT4_EES2_IT1_S4_XT5_EES2_IT2_S4_XT6_EES4_T_
iotamudelta commented 5 years ago

@Delaunay what GPU do you have? We currently need to compile specifically for a microarchitecture (changes to that are incoming). Export HCC_AMDGPU_TARGET prior to building to your uarch - either gfx803 (which we do not support well in PT, if you find issues please report them), gfx900 (Vega64/Vega56 generation, these work well), or gfx906 (Radeon VII, this should also work well)

Delaunay commented 5 years ago

Thanks, recompiled it overnight for the gfx803. It is working now. I only have one test failing on my side. Is it supposed to ? If not I can open another ticket and gather info on it.

======================================================================
FAIL: test_multinomial_invalid_probs_cuda (test_cuda.TestCuda)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/setepenre/rocm_pytorch/test/common_utils.py", line 296, in wrapper
    method(*args, **kwargs)
  File "/home/setepenre/rocm_pytorch/test/test_cuda.py", line 2223, in test_multinomial_invalid_probs_cuda
    self._spawn_method(test_method, torch.Tensor([1, -1, 1]))
  File "/home/setepenre/rocm_pytorch/test/test_cuda.py", line 2203, in _spawn_method
    self.fail(e)
AssertionError: False
iotamudelta commented 5 years ago

Yeah, that test works for me on gfx906. So please do open a ticket. I don't have a gfx803 setup currently but I'll try to have a look at it when I do and have time. In the meantime, we can discuss in that ticket how to root cause.

Is that the only failing test? That'd be better than I thought, to be honest.

Delaunay commented 5 years ago

This is what I got on my side overall with PYTORCH_TEST_WITH_ROCM=1 python test/run_test.py

I also ran resnet18 & resnet50. I will do more testing later but for now the timings look great.


For the person stumbling upon this thread. You can find below the rough steps describing how to compile without docker:

  1. Install ROCm here
  2. Install PyTorch dependencies (I recommend using Ninja)
  3. Install ROCm PyTorch dependencies (some might already be installed)
    • rocrand, hiprand, rocblas, miopen, miopengemm, rocfft, rocsparse, rocm-cmake, rocm-dev, rocm-device-libs,rocm-libs, hcc, hip_base,hip_hcc, hip-thrust
  4. Clone PyTorch repository
  5. 'Hipify' PyTorch source by executing python tools/amd_build/build_amd.py
  6. You can set export USE_NINJA=1 and export MAX_JOBS=N (N=(RAM in GB)/4)
  7. python setup.py [develop|install]
  8. Make sure everything is working with PYTORCH_TEST_WITH_ROCM=1 python test/run_test.py
iamkucuk commented 5 years ago

This is what I got on my side overall with PYTORCH_TEST_WITH_ROCM=1 python test/run_test.py

  • test_autograd 919 tests in 161s (6 skipped)
  • test_cuda 154 tests in 19sec (77 skipped, 1 failed)

I also ran resnet18 & resnet50. I will do more testing later but for now the timings look great.

For the person stumbling upon this thread. You can find below the rough steps describing how to compile without docker:

  1. Install ROCm here
  2. Install PyTorch dependencies (I recommend using Ninja)
  3. Install ROCm PyTorch dependencies (some might already be installed)

    • rocrand, hiprand, rocblas, miopen, miopengemm, rocfft, rocsparse, rocm-cmake, rocm-dev, rocm-device-libs,rocm-libs, hcc, hip_base,hip_hcc
  4. Clone PyTorch repository
  5. 'Hipify' PyTorch source by executing python tools/amd_build/build_amd.py
  6. Pick the architecture you want to compile for by setting HCC_AMDGPU_TARGET=gfx900 (multi arch support incoming)

    • gfx906 for Radeon VII
    • gfx900 for Vega
    • gfx803 for Radeon RX 470/480/570/570X/580/580X
  7. You can set export USE_NINJA=1 and export MAX_JOBS=N (N=(RAM in GB)/4)
  8. python setup.py [develop|install]
  9. Make sure everything is working with PYTORCH_TEST_WITH_ROCM=1 python test/run_test.py

Finally a proper answer! I can't thank you enough for this! Will try it ASAP!

iamkucuk commented 5 years ago

This is what I got on my side overall with PYTORCH_TEST_WITH_ROCM=1 python test/run_test.py

  • test_autograd 919 tests in 161s (6 skipped)
  • test_cuda 154 tests in 19sec (77 skipped, 1 failed)

I also ran resnet18 & resnet50. I will do more testing later but for now the timings look great.

For the person stumbling upon this thread. You can find below the rough steps describing how to compile without docker:

  1. Install ROCm here
  2. Install PyTorch dependencies (I recommend using Ninja)
  3. Install ROCm PyTorch dependencies (some might already be installed)

    • rocrand, hiprand, rocblas, miopen, miopengemm, rocfft, rocsparse, rocm-cmake, rocm-dev, rocm-device-libs,rocm-libs, hcc, hip_base,hip_hcc
  4. Clone PyTorch repository
  5. 'Hipify' PyTorch source by executing python tools/amd_build/build_amd.py
  6. Pick the architecture you want to compile for by setting HCC_AMDGPU_TARGET=gfx900 (multi arch support incoming)

    • gfx906 for Radeon VII
    • gfx900 for Vega
    • gfx803 for Radeon RX 470/480/570/570X/580/580X
  7. You can set export USE_NINJA=1 and export MAX_JOBS=N (N=(RAM in GB)/4)
  8. python setup.py [develop|install]
  9. Make sure everything is working with PYTORCH_TEST_WITH_ROCM=1 python test/run_test.py

Quick questions: I don't have any info about Ninja. Is this the package manager you are talking about? Is there a documentation for how to use it and using pip instead of ninja causes any trouble? Where can I find a RocM alternative for magma-cuda dependency? Or should I just ignore it?

Delaunay commented 5 years ago

ninja is just the build system that pytorch can use to compile itself. You do not have to use it. it is explained here.

ROCm has rocblas and miopen for linear algebra and Machine learning primitives respectively. I did not see anything about Magma when I installed pytorch.

masahi commented 5 years ago

@Delaunay thanks for the info I managed to build pytorch from source on my box! I should mention that I had to install thrust hip port to build caffe2.

Delaunay commented 5 years ago

thanks, I updated the list of dependencies

hameerabbasi commented 5 years ago

https://github.com/ROCmSoftwarePlatform/pytorch/issues/337#issuecomment-467220107 doesn't seem to work for me. I get this error no matter what I try:

 By not providing "Findhip.cmake" in CMAKE_MODULE_PATH this project has
  asked CMake to find a package configuration file provided by "hip", but
  CMake did not find one.

I'm willing to help debug the issue, I have all dependencies already installed.

iotamudelta commented 5 years ago

@Delaunay could you remove step #6 pertaining to HCC_AMDGPU_TARGET? The default is multi-arch now and it's a debug flag of the compiler that I'd rather we not continue exploit. :-)

Delaunay commented 5 years ago

nice, I updated it

iamkucuk commented 5 years ago

@Delaunay Hi mate! I'm trying to build pytorch with your way, however I'm experiencing some issues. Here is my script. Can you check it out? https://gist.github.com/iamkucuk/c8f74ec6d4f91804d6ff3d1006f26040

iotamudelta commented 5 years ago

We added documentation for host installs here: https://github.com/ROCmSoftwarePlatform/pytorch/wiki/Building-PyTorch-for-ROCm#option-4-install-directly-on-host

Please note that this requires good knowledge of your operating system, its package manager, and unfortunately in step 4) makes alterations to the ROCm install itself - we are hoping to fix the last in the future.

iamkucuk commented 5 years ago

We added documentation for host installs here: https://github.com/ROCmSoftwarePlatform/pytorch/wiki/Building-PyTorch-for-ROCm#option-4-install-directly-on-host

Please note that this requires good knowledge of your operating system, its package manager, and unfortunately in step 4) makes alterations to the ROCm install itself - we are hoping to fix the last in the future.

Why don't you provide a script for full installation process? PyTorch is becoming more popular, especially in academic world.

dagamayank commented 5 years ago

Why don't you provide a script for full installation process? PyTorch is becoming more popular, especially in academic world.

@msabony1966