Open treya-lin opened 1 year ago
A temporary update:
I just got it running with 7*A100-40G and the following batch_size
(batch_size as 64 will throw OOM)
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 16
Now the GPU status look like this, a bit concerning because I only have 40G on each of them. Will it goes up?
| 0 N/A N/A 190117 C /opt/conda/bin/python 39321MiB |
| 1 N/A N/A 190118 C /opt/conda/bin/python 39875MiB |
| 2 N/A N/A 190119 C /opt/conda/bin/python 39875MiB |
| 3 N/A N/A 190120 C /opt/conda/bin/python 39875MiB |
| 4 N/A N/A 190121 C /opt/conda/bin/python 39875MiB |
| 5 N/A N/A 190122 C /opt/conda/bin/python 39875MiB |
| 6 N/A N/A 190123 C /opt/conda/bin/python 39587MiB |
I will see if it will proceed properly. But I am still suprised that this 6B model requires a very strict demand for full-parameter finetuning. Is this normal in your local environment too? I just realized maybe it's because no offloading to cpu was used in this case.
Is it possible to add more guide on how to use zero3, cpu offload technique of deepspeed to reduce the required GPU memory? (something that this project is doing https://github.com/CVI-SZU/Linly/wiki/%E5%A2%9E%E9%87%8F%E8%AE%AD%E7%BB%83)
Is there an existing issue for this?
Current Behavior
Hi, I am trying to use
ds_train_finetune.sh
to finetune chatglm-6b with my dialogue data. I prepared my data as README.md suggests and editted the shell script adding the--history_column
. I have a few A100-40G but it still threw CUDA OOM error. Does anyone know how to get it work?? It is so strange.And I wonder if the developer can share the succesful log from your local environment? Can we have a more detailed document describing under what setup(GPU num and GPU memory), what config, how long has been taken for a better estimation before we start to work on it.
configuration in my
ds_train_finetune_chat.sh
(based on the officialds_train_finetune.sh
)The error log:
Expected Behavior
Make finetuneing work.
Steps To Reproduce
ds_train_finetune.sh
,changing or adding the following argument:bash ds_train_finetune.sh
Environment
Anything else?
No response