Installation might be incomplete

leedrake5 commented 6 hours ago

I am installing for the RTX 6000 Ada. I wanted to optimize for that system to run FP8. I follow the commands to install:

git clone https://github.com/Azure/MS-AMP.git
cd MS-AMP
git submodule update --init --recursive
cd third_party/msccl

# RTX Ada 6000
make -j src.build NVCC_GENCODE="-gencode=arch=compute_89,code=sm_89"

apt-get update
apt install build-essential devscripts debhelper fakeroot
make pkg.debian.build
sudo dpkg -i build/pkg/deb/libnccl2_*.deb
sudo dpkg -i build/pkg/deb/libnccl-dev_2*.deb

cd -

python3 -m pip install --upgrade pip
python3 -m pip install .
sudo -E make postinstall

I really really wish I didn't have to use sudo to do this, but didn't have much of an option given docker wasn't compatible for some reason ('zlib version less than 1.2.3' though the latest is installed on my system) and install rejected both virtual and conda envs because it doesn't like symlinks. I get no errors and everything installs (cuda 12.4) but I always get the same error:

  File "/home/<user>/.local/lib/python3.11/site-packages/msamp/optim/adamw.py", line 16, in <module>
    import msamp_adamw
ModuleNotFoundError: No module named 'msamp_adamw'

Which is odd because I am told it is in fact installed correctly by the initial installation instructions: ''' Successfully installed msamp_arithmetic-0.0.1 Successfully installed msamp_adamw-0.0.1 '''

Looking into the packages, I can see that there is no 'msamp_adam2' package (or 'msamp_arithmetic' for that matter), just 'msamp': Note that I am using ssh into the linux system from a Mac, thus the interface.

So I am very confused - are these libraries not installed?

I can get training in FP8 to work in a limited way with transformer engine - but I'd really like to use ms-amp. But I don't see a feasible way to do so

leedrake5 commented 5 hours ago

In case it helps, here's what happens when I try to use the recommended docker image:

$ sudo docker run -it -d --name=msamp --privileged --net=host --ipc=host --gpus=all nvcr.io/nvidia/pytorch:23.10-py3 bash
Unable to find image 'nvcr.io/nvidia/pytorch:23.10-py3' locally
23.10-py3: Pulling from nvidia/pytorch
37aaf24cf781: Extracting [==================================================>]  29.54MB/29.54MB
c15d1d6b2c11: Download complete 
7e97a8ec5681: Download complete 
894330fe1bf5: Download complete 
97707dfd1d40: Download complete 
d69ae92c3e1e: Download complete 
a013d53fd443: Download complete 
18989d23e6f7: Download complete 
53638f96ad3c: Download complete 
edbefd2705db: Download complete 
4a10dab4bd4c: Download complete 
7ee32cc2089f: Download complete 
91eeea9164ed: Download complete 
7aa5209b2eba: Download complete 
6729082aba49: Download complete 
c926d5f5cde0: Download complete 
4f4fb700ef54: Download complete 
c8a736dc04ec: Download complete 
07cf6ce1eca7: Download complete 
9c90b8728b50: Download complete 
ade437946b14: Download complete 
5e8709f8c02d: Download complete 
866ac4b0341d: Download complete 
9d3d147186f3: Download complete 
5d57a558faf6: Download complete 
1373dde86157: Download complete 
ad53f9124ce2: Download complete 
9f11293d3693: Download complete 
7d470bc79d5a: Download complete 
cfb097252e12: Download complete 
9d44a7cad2d3: Download complete 
8771942f5e66: Download complete 
3660958c4b05: Download complete 
ee70577cbd50: Download complete 
264099d06354: Download complete 
1cd8bbadca17: Download complete 
c39a5d65a5d5: Download complete 
7e7320428757: Download complete 
3748e1ef72fc: Download complete 
0e694a487f92: Download complete 
e37baf71233c: Download complete 
0408d58f5552: Download complete 
d53c17415131: Download complete 
b55316528757: Download complete 
2285bb2a2191: Download complete 
1ad3a5cc4688: Download complete 
776e338b632d: Download complete 
cca7b16b7b04: Download complete 
9c55ea83da60: Download complete 
48bdaee6e86f: Download complete 
77ee12f01893: Download complete 
docker: failed to register layer: exit status 22: unpigz: abort: zlib version less than 1.2.3

As far as I can tell, my installs of pigsz and zlib are up to date:

zlib1g-dev is already the newest version (1:1.3.dfsg-3.1ubuntu2.1).
pigz is already the newest version (2.4-1).

So not sure what can be done to remedy the problem.

wkcn commented 5 hours ago

Hi @leedrake5 , thanks for your attention to our work!

I could not reproduce the issue. It seems that the packages msamp_arithmetic and msamp_adamw are not copied into the site-packages folder of python.

You can try to copy the following *.so files into the site-packages folder manually.

msamp/operators/arithmetic/build/lib.linux-*/msamp_arithmetic.cpython-*.so
msamp/optim/build/lib.linux-*/msamp_adamw.cpython-*.so

The environmental variable should be set:

LD_PRELOAD="${THE_PATH_OF_MSAMP}/msamp/operators/dist_op/build/libmsamp_dist.so:/usr/local/lib/libnccl.so:${LD_PRELOAD}"

leedrake5 commented 4 hours ago

Thought I had it, but I didn't. When I run the following commands:

''' NCCL_LIBRARY=/usr/lib/x86_64-linux-gnu/libnccl.so # Change as needed export LD_PRELOAD="/usr/local/lib/libmsamp_dist.so:${NCCL_LIBRARY}:${LD_PRELOAD}" '''

...it unfortunately breaks torch

'''

import torch Traceback (most recent call last): File "", line 1, in File "/home/bly/.local/lib/python3.11/site-packages/torch/init.py", line 368, in from torch._C import * # noqa: F403 ^^^^^^^^^^^^^^^^^^^^^^ ImportError: /home//.local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so: undefined symbol: ncclCommRegister '''

I'm guessing it didn't install correctly. I don't want this to devolve to an individual operator problem, but what gets more unusual is that if I don't link, training works (with accelerate config msamp O2). It's slower than bf16, but loss is reasonable and RAM usage is down. So it seems to have kinda installed.

Azure / MS-AMP

Installation might be incomplete #196