Open jaywongs opened 7 months ago
Try pip install -U deepspeed
.
This solved a similar problem with mistral 7b
@jaywongs , did the above solve it for you? I find this issue dependent on machine. It may also be bitsandbytes issue.
@jaywongs , did the above solve it for you? I find this issue dependent on machine. It may also be bitsandbytes issue.
Yes, it solved for me!
@jaywongs , did the above solve it for you? I find this issue dependent on machine. It may also be bitsandbytes issue.
Apologies for the delayed response. I have tried using the latest version of deepspeed, but the error persists.
@jaywongs , did upgrading deepspeed work for you?
@jaywongs , did upgrading deepspeed work for you?
not work for me,i use the deepspeed 0.14.2
@jaywongs , did upgrading deepspeed work for you?
not work for me,i use the deepspeed 0.14.2
Hello, have you solved it? I also encountered the same problem.
@jaywongs , did upgrading deepspeed work for you?
not work for me,i use the deepspeed 0.14.2
Hello, have you solved it? I also encountered the same problem.
Unfortunately, I was unable to solve it in the end.
Same error here.
Error invalid configuration argument at line 218 in file /src/csrc/ops.cu
I used winglian/axolotl:main-latest
docker image and my configurations is shown below:
**** Axolotl Dependency Versions *****
accelerate: 0.33.0
peft: 0.12.0
transformers: 4.44.0
trl: 0.9.6
torch: 2.3.1+cu121
bitsandbytes: 0.43.3
****************************************
deepspeed: 0.15.0
Hey everyone, apologies for taking so long to circle back to this. Unfortunately, I could not reproduce this issue on runpod nodes. I used winglian/axolotl-cloud:main-latest
on 2xa40 and did not meet this issue with qlora configs.
Are these all from local systems or from cloud systems? If the latter, have you tried provisioning another node? Secondly, does it only happen with certain configs (large models / small models , full ft / adapter)?
Please check that this issue hasn't been reported before.
Expected Behavior
the train task should be start pertfectly
Current behaviour
Error invalid configuration argument at line 119 in file /src/csrc/ops.cu
Steps to reproduce
I trained the Codellama-70b model using 8 A100 80G GPUs. I performed a full fine-tune and used the following shell to start the training process:
Config yaml
Possible solution
No response
Which Operating Systems are you using?
Python Version
3.11.5
axolotl branch-commit
main/132eb740f036eff0fa8b239ddaf0b7a359ed1732
Acknowledgements