N8-CIR-Bede / documentation

Documentation for the N8CIR Bede Tier 2 HPC faciltiy
https://bede-documentation.readthedocs.io/en/latest/
7 stars 11 forks source link

Grace-Hopper Pytorch wheels with CUDA support #199

Closed ptheywood closed 1 month ago

ptheywood commented 5 months ago

It looks like aarch64 torch wheels with cuda support may be coming to pytorch 2.4. This is planned to be released in July 2024.

Once this is released, we should check if aarch64 wheels include cuda support, and if so update the aarch64 documentation accordingly.

https://github.com/pytorch/builder/pull/1775

willfurnass commented 5 months ago

Good news. Owain Kenway's Twitter feed has full of pained remarks about how tricky it is to build pytorch wheels on GH for several weeks now.

owainkenwayucl commented 5 months ago

It's easy to build but hard to replicate the performance of the wheel from the NGC container.

Currently my builds are about 10% slower running Stable Diffusion XL inference and I can't work out why.

PerilousApricot commented 5 months ago

Hi - @owainkenwayucl do you have some documents somewhere on how you accomplished that? I'm happy to eat the 10% perf hit if it keeps me out of trying to decipher incredibly confusing compiler errors

ptheywood commented 3 months ago

Nightly Pytorch 2.4 and 2.5 builds using CUDA 12.4 via pip include linux-aarch64 builds with CUDA support (nightly channel pytorch list). The wheels are very large (2354.8 MB for python 3.9, as they include all the cuda deps rather than depending on an external package)

CUDA 11.8 and 12.2 nightly builds do not include CUDA support still.

Conda nightly packages do not include linux-aarch64 builds at all, just osx-arm64, linux-64 and win-64, but installing via pip into a conda env seems to behave.

python3 -m venv venv-pytorch-nightly
source venv-pytorch-nightly/
python3 -m pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu124
python3 -c "import torch; print(torch.__version__); print(torch.cuda.is_available()); print(torch.cuda.get_arch_list())"
/path/to/venv-pytorch-nightly/lib64/python3.9/site-packages/torch/_subclasses/functional_tensor.py:258: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:84.)
  cpu = _conversion_method_template(device=torch.device("cpu"))
2.5.0.dev20240621+cu124
True
['sm_50', 'sm_80', 'sm_86', 'sm_89', 'sm_90', 'sm_90a']

The installation instruction can be found in the pytorch getting started tool, but it doesn't include which platforms are availble etc

https://pytorch.org/get-started/locally/

ptheywood commented 2 months ago

Pytorch 2.4.0 was released on 2024-07-24.

This should include ARM + CUDA builds (atleast for CUDA 12.4?)

Todo

owainkenwayucl commented 1 month ago

Just a note - on RHEL 9 presently with Python 3.9, you don't get the Cuda enabled Pytorch 2.4.0, you get CPU Pytorch 2.0.1 from https://download.pytorch.org/whl/cu124 unless you do torch==2.4.0.

Without:

Installing collected packages: mpmath, urllib3, typing-extensions, sympy, pillow, numpy, networkx, MarkupSafe, idna, filelock, charset-normalizer, certifi, requests, jinja2, torch, torchvision, torchaudio
Successfully installed MarkupSafe-2.1.5 certifi-2022.12.7 charset-normalizer-2.1.1 filelock-3.13.1 idna-3.4 jinja2-3.1.3 mpmath-1.3.0 networkx-3.2.1 numpy-1.26.3 pillow-10.2.0 requests-2.28.1 sympy-1.12 torch-2.0.1 torchaudio-2.0.2 torchvision-0.15.2 typing-extensions-4.9.0 urllib3-1.26.13
(py24test3) [uccaoke@locust uccaoke]$ python3
Python 3.9.18 (main, Jan  4 2024, 00:00:00) 
[GCC 11.4.1 20230605 (Red Hat 11.4.1-2)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
False
>>> 
(py24test3) [uccaoke@locust uccaoke]$ pip list
Package            Version
------------------ ---------
certifi            2022.12.7
charset-normalizer 2.1.1
filelock           3.13.1
idna               3.4
Jinja2             3.1.3
MarkupSafe         2.1.5
mpmath             1.3.0
networkx           3.2.1
numpy              1.26.3
pillow             10.2.0
pip                24.2
requests           2.28.1
setuptools         72.2.0
sympy              1.12
torch              2.0.1
torchaudio         2.0.2
torchvision        0.15.2
typing_extensions  4.9.0
urllib3            1.26.13
wheel              0.36.2
(py24test3) [uccaoke@locust uccaoke]$ 

However,

pip3 install torch==2.4.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

Works.

It's not clear to me why this is.

ptheywood commented 1 month ago

Bede's Grace-Hopper nodes are currently Rocky 9.4, where it fetches 2.4.0 using the system python 3.9.18 (using --no-cache-dir to make sure it wasn't re-using one i'd installed explicitly before hand).

(.venv) [pheywood@gh001.bede gh-pytorch]$ python3 
Python 3.9.18 (main, May 16 2024, 00:00:00) 
[GCC 11.4.1 20231218 (Red Hat 11.4.1-3)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch 
/nobackup/projects/bdshe01/pheywood/aarch64/gh-pytorch/.venv/lib64/python3.9/site-packages/torch/_subclasses/functional_tensor.py:258: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:84.)
  cpu = _conversion_method_template(device=torch.device("cpu"))
>>> torch.cuda.is_available()
True
>>>
(.venv) [pheywood@gh001.bede gh-pytorch]$ pip list 
Package           Version
----------------- --------
filelock          3.13.1
fsspec            2024.2.0
Jinja2            3.1.3
MarkupSafe        2.1.5
mpmath            1.3.0
networkx          3.2.1
pip               21.2.3
setuptools        53.0.0
sympy             1.12
torch             2.4.0
typing_extensions 4.9.0

pip/setuptools being super old here doesn't seem to be the cause of the differnece, as it still fetches 2.4.0 after upgrading both in a fresh venv. Unsure what else could be causing RHEL's pip to resolve the wrong verison.

The same applies when requesting torchaudio and torchvision, not just torch above.

owainkenwayucl commented 1 month ago

Isn't Python packaging wonderful? :D