Closed hobpond closed 11 months ago
Using the sample deepspeed command。The same situation,kill process and return -9. How to fix.
请检查下你的package版本 以及是否有足够的内存。
Thanks for the quick response!
You are right. Looks like an OOM but not VRAM. All 170GB of system ram was used just before the python process died. I changed pin_memory to false in configs/ds_config_zero3.json for both offload_optimizer and offload_param and that got it to start fine tuning. (I also chnaged the deepspeed param of --per_device_train_batch_size 1 instead of 16.)
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": false
},
"offload_param": {
"device": "cpu",
"pin_memory": false
},
...
Not sure if I should tune sub_group_size
as well but it is fine tuning now so will report back if a better config is found for a single A100 80GB.
Thanks again for the help!
Using nvidia-smi to monitor the VRAM, my per_device_train_batch_size was way too small.
Thanks for the quick response!
You are right. Looks like an OOM but not VRAM. All 170GB of system ram was used just before the python process died. I changed _pinmemory to false in _configs/ds_configzero3.json for both _offloadoptimizer and _offloadparam and that got it to start fine tuning. (I also chnaged the deepspeed param of _--per_device_train_batchsize 1 instead of 16.)
"zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", "pin_memory": false }, "offload_param": { "device": "cpu", "pin_memory": false }, ...
Not sure if I should tune
sub_group_size
as well but it is fine tuning now so will report back if a better config is found for a single A100 80GB.Thanks again for the help!
Hey! Can you tell me the minimum requirement (like GPU VRAM, System RAM, memory) for finetuning using the edits you made in config.json? Actually I am having the same -9 error. Have you found any better config.json?
Thank you for the handy fine tuning guide but I am not able to get started.
I tried using the default settings as a POC but it ends up erroring out.
This is the output I get when using the sample deepspeed command in the README.md
I tried to run the finetune_deepseekcoder.py script directly to see what the actual error is and it outputted