Open fzmushko opened 1 week ago
I have also tried new enviroment with torch==2.3.0 which was used in original paper (install.sh
script). In this case I encounter another error:
Traceback (most recent call last):
File "/extra_disk_1/zmushko-fa/test_ist.py", line 2, in <module>
from ista_daslab_optimizers import MicroAdam
File "/home/zmushko-fa/miniconda3/envs/microadam/lib/python3.9/site-packages/ista_daslab_optimizers/__init__.py", line 2, in <module>
from .micro_adam import *
File "/home/zmushko-fa/miniconda3/envs/microadam/lib/python3.9/site-packages/ista_daslab_optimizers/micro_adam/__init__.py", line 1, in <module>
from .micro_adam import MicroAdam
File "/home/zmushko-fa/miniconda3/envs/microadam/lib/python3.9/site-packages/ista_daslab_optimizers/micro_adam/micro_adam.py", line 7, in <module>
from ..tools import get_first_device, get_gpu_mem_usage, block_split, CopyDirection
File "/home/zmushko-fa/miniconda3/envs/microadam/lib/python3.9/site-packages/ista_daslab_optimizers/tools.py", line 6, in <module>
import ista_daslab_tools
ImportError: /home/zmushko-fa/miniconda3/envs/microadam/lib/python3.9/site-packages/ista_daslab_tools.cpython-39-x86_64-linux-gnu.so: undefined symbol: _ZN3c1021throwNullDataPtrErrorEv
I will also duplicate this issue in ista-daslab repository.
Hi! Thank you for reaching out! I will have a deeper look at this starting next week and will try to solve it as soon as possible.
I have also tried new enviroment with torch==2.3.0 which was used in original paper (
install.sh
script). In this case I encounter another error:Traceback (most recent call last): File "/extra_disk_1/zmushko-fa/test_ist.py", line 2, in <module> from ista_daslab_optimizers import MicroAdam File "/home/zmushko-fa/miniconda3/envs/microadam/lib/python3.9/site-packages/ista_daslab_optimizers/__init__.py", line 2, in <module> from .micro_adam import * File "/home/zmushko-fa/miniconda3/envs/microadam/lib/python3.9/site-packages/ista_daslab_optimizers/micro_adam/__init__.py", line 1, in <module> from .micro_adam import MicroAdam File "/home/zmushko-fa/miniconda3/envs/microadam/lib/python3.9/site-packages/ista_daslab_optimizers/micro_adam/micro_adam.py", line 7, in <module> from ..tools import get_first_device, get_gpu_mem_usage, block_split, CopyDirection File "/home/zmushko-fa/miniconda3/envs/microadam/lib/python3.9/site-packages/ista_daslab_optimizers/tools.py", line 6, in <module> import ista_daslab_tools ImportError: /home/zmushko-fa/miniconda3/envs/microadam/lib/python3.9/site-packages/ista_daslab_tools.cpython-39-x86_64-linux-gnu.so: undefined symbol: _ZN3c1021throwNullDataPtrErrorEv
In your initial message I see you are using CUDA 12.1
. In our development we used CUDA 12.2
. If you have access to a cluster, please try running module load cuda/12.2
or module load cuda/12.4
and then activate your environment. This error happens because the CUDA kernels for MicroAdam were built using CUDA 12.2
, which includes some fixes to CUDA 12.1
.
To sum up:
- Looks like optimizer doesn't work with
float32
master weights. Is it supposed to be so?- Looks like there is a bug when using device other than cuda:0.
MicroAdam
currently supports bfloat16
because our main focus was to reduce memory usage, as stated in the paper. Please use our optimizer with dtype=torch.bfloat16
.CUDA_VISIBLE_DEVICES
accordingly. For example, if your system has 8 GPUs and you want to use GPU 1, please run your program using CUDA_VISIBLE_DEVICES=1 python main.py
and keep device=cuda:0
. I believe this should be quick fix before I try to reproduce the error.Thank you for opening this issue and please let me know how it works! I am happy to help!
MicroAdam
currently supportsbfloat16
because our main focus was to reduce memory usage, as stated in the paper. Please use our optimizer withdtype=torch.bfloat16
.
Ah, I see. I just didn't notice that mentioned in either the article or the README (though, perhaps I wasn't paying enough attention). I assumed that float32
support should be present because pure bf16
training is sometimes leads to worse results without advanced techniques as stochastic rounding.
However, I tried running the same code with float32
weights and manually casting the gradients to bf16
after the loss.backward()
. In this case, the code doesn't break, and I suppose this approach might make sense because before adding the final bf16
update to the fp32
weights, it will first be cast to fp32
. This could ensure that the weight update remains accurate enough.
- I suggest setting your
CUDA_VISIBLE_DEVICES
accordingly. For example, if your system has 8 GPUs and you want to use GPU 1, please run your program usingCUDA_VISIBLE_DEVICES=1 python main.py
and keepdevice=cuda:0
. I believe this should be quick fix before I try to reproduce the error.
It works, thank you.
I agree, manually casting the gradient to bfloat16
works. We did this in the FFCV
repository where we had some issues with mixed precision. I will mention the bfloat16
format in the readme, I think we missed that. Thank you for pointing out! Please let me know whether there are some other issues that I can help with.
I am trying to use MicroAdam optimizer, but i face crashes when trying to perform
optimizer.step()
.Setup: Empty conda enviroment, only
torch
andista-daslab-optimizers
are installed viapip install torch ista-daslab-optimizers
. torch 2.4.1, ista-daslab-optimizers 1.1.3, CUDA 12.1, Python 3.9, A100-SXM4-80GB. I run the following code:and receive the following error:
The same error occurs when I try to run it in mixed precision with
torch.autocast
.Since it says something about
BFloat16
I have setdtype = torch.bfloat16
and this code works. However, if I change device fromcuda:0
tocuda:1
, I again encounter an error:Running with CUDA_LAUNCH_BLOCKING=1 returns:
To sum up: 1) Looks like optimizer doesn't work with
float32
master weights. Is it supposed to be so? 2) Looks like there is a bug when using device other than cuda:0.