Azure / MS-AMP

Microsoft Automatic Mixed Precision Library
https://azure.github.io/MS-AMP/
MIT License
505 stars 38 forks source link

Optimizer compilation fails with PyTorch 2.2 #158

Open rosario-purple opened 7 months ago

rosario-purple commented 7 months ago

What's the issue, what's expected?:

I tried to compile the MS-AMP optimizer with the new Torch 2.2:

cd msamp/optim
pip install -v .

but got this error:

    File "/scratch/brr/MS-AMP/msamp/optim/setup.py", line 7, in <module>
      from torch.utils import cpp_extension
    File "/scratch/miniconda3/envs/brr/lib/python3.10/site-packages/torch/__init__.py", line 237, in <module>
      from torch._C import *  # noqa: F403
  ImportError: /scratch/miniconda3/envs/brr/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so: undefined symbol: ncclCommRegister
  error: subprocess-exited-with-error

  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> See above for output.

How to reproduce it?:

Running this code in Python reproduces the error:

>>> import torch
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/scratch/miniconda3/envs/brr/lib/python3.10/site-packages/torch/__init__.py", line 237, in <module>
    from torch._C import *  # noqa: F403
ImportError: /scratch/miniconda3/envs/brr/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so: undefined symbol: ncclCommRegister

Log message or shapshot?:

See above

Additional information:

My best guess is that this is caused by MS-AMP being pinned to an external old version of libnccl (2.17.1), while PyTorch 2.2 seems to depend on a newer version (2.19.3).

tocean commented 7 months ago

We haven't test MS-AMP with pytorch 22. Currently we only support pytorch1.14 and 2.1. And it is recommended to use our docker image or nvcr.io/nvidia/pytorch:23.10-py3. And we have plan to upgrade msccl to latest version.

tocean commented 4 weeks ago

Can you share me the complete steps of reproducing this issue?