Open avemio-digital opened 1 month ago
It's quite memory intensive - what hardware are you running on? We're developing a way to compute the teacher logits beforehand which should help with this issue.
I am using a100 80 gb vram without using deepspeed .. Teacher model is arcee spark and student model : qwen 1.5 but it fails to start for oom issue..
@avemio-digital should work... do you have a replication notebook or some code for us?
after running !accelerate launch distil_logits.py I am getting below error
Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.
warnings.warn(message, FutureWarning)
/usr/local/lib/python3.10/dist-packages/trl/trainer/sft_trainer.py:280: UserWarning: You passed a max_seq_length
argument to the SFTTrainer, the value you passed will override the one in the SFTConfig
.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_deprecation.py:100: FutureWarning: Deprecated argument(s) used in 'init': max_seq_length. Will not be supported from version '1.0.0'.
max_seq_length
argument to the SFTTrainer, the value you passed will override the one in the SFTConfig
.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/trl/trainer/sft_trainer.py:408: UserWarning: You passed a tokenizer with padding_side
not equal to right
to the SFTTrainer. This might lead to some unexpected behaviour due to overflow issues when training a model in half-precision. You might consider adding tokenizer.padding_side = 'right'
to your code.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/trl/trainer/sft_trainer.py:408: UserWarning: You passed a tokenizer with padding_side
not equal to right
to the SFTTrainer. This might lead to some unexpected behaviour due to overflow issues when training a model in half-precision. You might consider adding tokenizer.padding_side = 'right'
to your code.
warnings.warn(
0%| | 0/1687 [00:00<?, ?it/s]Traceback (most recent call last):
File "/workspace/DistillKit/distil_logits.py", line 187, in Failures:
Can you share your notebook/code? and please, If you could gently format before pasting, it will help a lot following up your issue.
@fernando-neto-ai Im facing the same OOM issue even with 8xA100 (40GB), im not using deepspeed, teacher model : solar 10.7B and student model : qwen 1.5 B, attaching the script for reference.
You should tilt your max_length to fit your VRAM... If you still OOM, you should use DDP or FSDP. Also, another suggestion would be saving the logits offline from the teacher model, so you can save a lot of VRAM.
@avemio-digital Hi, I also have this problem, how did you solve it? @fernando-neto-ai
distill_logits.py script generates cuda oom issue while starting the training.