Open generalsvr opened 11 months ago
This is a known issue atm. I ran into this as well attempting regular Lora.
I would recommend opening an issue on peft.
I'm hitting the same error. I can also confirm that qLoRA works fine.
The only thing I can get working is the provided 4-bit qlora config, anything else lora, qlora (8 bit, etc), or fp16 results in OOM or some other error. I am working from a 5 node cluster where each node has 8 x H100.
I could fix this RuntimeError: Output 0 of MatMul8bitLtBackward is a view and is being modified inplace.
with: https://github.com/huggingface/peft/pull/1372/files.
After that, I came across the issue:
Traceback (most recent call last):
File "/home/demo.py", line 43, in <module>
loss.backward()
File "/home/venv/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
torch.autograd.backward(
File "/home/venv/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/home/venv/lib/python3.10/site-packages/torch/autograd/function.py", line 288, in apply
return user_fn(self, *args)
File "/home/venv/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 485, in backward
.mul_(state.SCB.unsqueeze(1).mul(1.0 / 127.0))
RuntimeError: The size of tensor a (32) must match the size of tensor b (8) at non-singleton dimension 0
Has anyone had any success in training 8-bit LoRA fine-tuning for Mixtral with 80GB? Thank you.
File "/XXXXXX/miniconda3/envs/lora/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 485, in backward
.mul_(state.SCB.unsqueeze(1).mul(1.0 / 127.0))
RuntimeError: The size of tensor a (32) must match the size of tensor b (8) at non-singleton dimension 0
I encountered the same issue
Please check that this issue hasn't been reported before.
Expected Behavior
Expected to run lora fine-tuning on 2xA100 GPU
Current behaviour
Steps to reproduce
Run axolotl docker image on Runpod 2xA100 GPU with lora 8bit. Qlora works
Config yaml
Possible solution
No response
Which Operating Systems are you using?
Python Version
3.10.13
axolotl branch-commit
main
Acknowledgements