Open leedrake5 opened 6 hours ago
In case it helps, here's what happens when I try to use the recommended docker image:
$ sudo docker run -it -d --name=msamp --privileged --net=host --ipc=host --gpus=all nvcr.io/nvidia/pytorch:23.10-py3 bash
Unable to find image 'nvcr.io/nvidia/pytorch:23.10-py3' locally
23.10-py3: Pulling from nvidia/pytorch
37aaf24cf781: Extracting [==================================================>] 29.54MB/29.54MB
c15d1d6b2c11: Download complete
7e97a8ec5681: Download complete
894330fe1bf5: Download complete
97707dfd1d40: Download complete
d69ae92c3e1e: Download complete
a013d53fd443: Download complete
18989d23e6f7: Download complete
53638f96ad3c: Download complete
edbefd2705db: Download complete
4a10dab4bd4c: Download complete
7ee32cc2089f: Download complete
91eeea9164ed: Download complete
7aa5209b2eba: Download complete
6729082aba49: Download complete
c926d5f5cde0: Download complete
4f4fb700ef54: Download complete
c8a736dc04ec: Download complete
07cf6ce1eca7: Download complete
9c90b8728b50: Download complete
ade437946b14: Download complete
5e8709f8c02d: Download complete
866ac4b0341d: Download complete
9d3d147186f3: Download complete
5d57a558faf6: Download complete
1373dde86157: Download complete
ad53f9124ce2: Download complete
9f11293d3693: Download complete
7d470bc79d5a: Download complete
cfb097252e12: Download complete
9d44a7cad2d3: Download complete
8771942f5e66: Download complete
3660958c4b05: Download complete
ee70577cbd50: Download complete
264099d06354: Download complete
1cd8bbadca17: Download complete
c39a5d65a5d5: Download complete
7e7320428757: Download complete
3748e1ef72fc: Download complete
0e694a487f92: Download complete
e37baf71233c: Download complete
0408d58f5552: Download complete
d53c17415131: Download complete
b55316528757: Download complete
2285bb2a2191: Download complete
1ad3a5cc4688: Download complete
776e338b632d: Download complete
cca7b16b7b04: Download complete
9c55ea83da60: Download complete
48bdaee6e86f: Download complete
77ee12f01893: Download complete
docker: failed to register layer: exit status 22: unpigz: abort: zlib version less than 1.2.3
As far as I can tell, my installs of pigsz and zlib are up to date:
zlib1g-dev is already the newest version (1:1.3.dfsg-3.1ubuntu2.1).
pigz is already the newest version (2.4-1).
So not sure what can be done to remedy the problem.
Hi @leedrake5 , thanks for your attention to our work!
I could not reproduce the issue.
It seems that the packages msamp_arithmetic
and msamp_adamw
are not copied into the site-packages
folder of python.
You can try to copy the following *.so files into the site-packages
folder manually.
msamp/operators/arithmetic/build/lib.linux-*/msamp_arithmetic.cpython-*.so
msamp/optim/build/lib.linux-*/msamp_adamw.cpython-*.so
The environmental variable should be set:
LD_PRELOAD="${THE_PATH_OF_MSAMP}/msamp/operators/dist_op/build/libmsamp_dist.so:/usr/local/lib/libnccl.so:${LD_PRELOAD}"
Thought I had it, but I didn't. When I run the following commands:
''' NCCL_LIBRARY=/usr/lib/x86_64-linux-gnu/libnccl.so # Change as needed export LD_PRELOAD="/usr/local/lib/libmsamp_dist.so:${NCCL_LIBRARY}:${LD_PRELOAD}" '''
...it unfortunately breaks torch
'''
import torch Traceback (most recent call last): File "
", line 1, in File "/home/bly/.local/lib/python3.11/site-packages/torch/init.py", line 368, in from torch._C import * # noqa: F403 ^^^^^^^^^^^^^^^^^^^^^^ ImportError: /home/ /.local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so: undefined symbol: ncclCommRegister '''
I'm guessing it didn't install correctly. I don't want this to devolve to an individual operator problem, but what gets more unusual is that if I don't link, training works (with accelerate config msamp O2). It's slower than bf16, but loss is reasonable and RAM usage is down. So it seems to have kinda installed.
I am installing for the RTX 6000 Ada. I wanted to optimize for that system to run FP8. I follow the commands to install:
I really really wish I didn't have to use sudo to do this, but didn't have much of an option given docker wasn't compatible for some reason ('zlib version less than 1.2.3' though the latest is installed on my system) and install rejected both virtual and conda envs because it doesn't like symlinks. I get no errors and everything installs (cuda 12.4) but I always get the same error:
Which is odd because I am told it is in fact installed correctly by the initial installation instructions: ''' Successfully installed msamp_arithmetic-0.0.1 Successfully installed msamp_adamw-0.0.1 '''
Looking into the packages, I can see that there is no 'msamp_adam2' package (or 'msamp_arithmetic' for that matter), just 'msamp': Note that I am using ssh into the linux system from a Mac, thus the interface.
So I am very confused - are these libraries not installed?
I can get training in FP8 to work in a limited way with transformer engine - but I'd really like to use ms-amp. But I don't see a feasible way to do so