arcee-ai / DistillKit

An Open Source Toolkit For LLM Distillation
GNU Affero General Public License v3.0
281 stars 31 forks source link

CUDA Out of memory issue #4

Open avemio-digital opened 1 month ago

avemio-digital commented 1 month ago

distill_logits.py script generates cuda oom issue while starting the training.

Crystalcareai commented 1 month ago

It's quite memory intensive - what hardware are you running on? We're developing a way to compute the teacher logits beforehand which should help with this issue.

avemio-digital commented 1 month ago

I am using a100 80 gb vram without using deepspeed .. Teacher model is arcee spark and student model : qwen 1.5 but it fails to start for oom issue..

Jacobsolawetz commented 1 month ago

@avemio-digital should work... do you have a replication notebook or some code for us?

avemio-digital commented 1 month ago

after running !accelerate launch distil_logits.py I am getting below error

Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead. warnings.warn(message, FutureWarning) /usr/local/lib/python3.10/dist-packages/trl/trainer/sft_trainer.py:280: UserWarning: You passed a max_seq_length argument to the SFTTrainer, the value you passed will override the one in the SFTConfig. warnings.warn( /usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_deprecation.py:100: FutureWarning: Deprecated argument(s) used in 'init': max_seq_length. Will not be supported from version '1.0.0'.

Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead. warnings.warn(message, FutureWarning) /usr/local/lib/python3.10/dist-packages/trl/trainer/sft_trainer.py:280: UserWarning: You passed a max_seq_length argument to the SFTTrainer, the value you passed will override the one in the SFTConfig. warnings.warn( /usr/local/lib/python3.10/dist-packages/trl/trainer/sft_trainer.py:408: UserWarning: You passed a tokenizer with padding_side not equal to right to the SFTTrainer. This might lead to some unexpected behaviour due to overflow issues when training a model in half-precision. You might consider adding tokenizer.padding_side = 'right' to your code. warnings.warn( /usr/local/lib/python3.10/dist-packages/trl/trainer/sft_trainer.py:408: UserWarning: You passed a tokenizer with padding_side not equal to right to the SFTTrainer. This might lead to some unexpected behaviour due to overflow issues when training a model in half-precision. You might consider adding tokenizer.padding_side = 'right' to your code. warnings.warn( 0%| | 0/1687 [00:00<?, ?it/s]Traceback (most recent call last): File "/workspace/DistillKit/distil_logits.py", line 187, in trainer.train(resume_from_checkpoint=config["training"]["resume_from_checkpoint"]) File "/usr/local/lib/python3.10/dist-packages/trl/trainer/sft_trainer.py", line 451, in train output = super().train(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1948, in train return inner_training_loop( File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2289, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 3328, in training_step loss = self.compute_loss(model, inputs) File "/workspace/DistillKit/distil_logits.py", line 150, in compute_loss custom_loss = self.distillation_loss(student_outputs.logits, teacher_outputs.logits, inputs, student_outputs.loss) File "/workspace/DistillKit/distil_logits.py", line 159, in distillation_loss loss_kd = F.kl_div( File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 2955, in kl_div reduced = torch.kl_div(input, target, reduction_enum, log_target=log_target) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.64 GiB. GPU 0 has a total capacty of 79.14 GiB of which 134.75 MiB is free. Process 258614 has 79.00 GiB memory in use. Of the allocated memory 77.58 GiB is allocated by PyTorch, and 314.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Traceback (most recent call last): File "/workspace/DistillKit/distil_logits.py", line 187, in trainer.train(resume_from_checkpoint=config["training"]["resume_from_checkpoint"]) File "/usr/local/lib/python3.10/dist-packages/trl/trainer/sft_trainer.py", line 451, in train output = super().train(args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1948, in train return inner_training_loop( File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2289, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 3328, in training_step loss = self.compute_loss(model, inputs) File "/workspace/DistillKit/distil_logits.py", line 150, in compute_loss custom_loss = self.distillation_loss(student_outputs.logits, teacher_outputs.logits, inputs, student_outputs.loss) File "/workspace/DistillKit/distil_logits.py", line 159, in distillation_loss loss_kd = F.kl_div( File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 2955, in kl_div reduced = torch.kl_div(input, target, reduction_enum, log_target=log_target) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.64 GiB. GPU 1 has a total capacty of 79.14 GiB of which 192.75 MiB is free. Process 258615 has 78.94 GiB memory in use. Of the allocated memory 77.46 GiB is allocated by PyTorch, and 373.75 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF 0%| | 0/1687 [00:05<?, ?it/s]
[2024-08-09 00:09:43,268] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1214 closing signal SIGTERM [2024-08-09 00:09:43,683] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1213) of binary: /usr/bin/python Traceback (most recent call last): File "/usr/local/bin/accelerate", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main args.func(args) File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1097, in launch_command multi_gpu_launcher(args) File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 734, in multi_gpu_launcher distrib_run.run(args) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

distil_logits.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-08-09_00:09:43 host : 7dd6740525d6 rank : 0 (local_rank: 0) exitcode : 1 (pid: 1213) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html =======================
fernando-neto-ai commented 3 weeks ago

Can you share your notebook/code? and please, If you could gently format before pasting, it will help a lot following up your issue.

sukkritsharmaofficial commented 2 weeks ago

@fernando-neto-ai Im facing the same OOM issue even with 8xA100 (40GB), im not using deepspeed, teacher model : solar 10.7B and student model : qwen 1.5 B, attaching the script for reference.

train_distill.txt

fernando-neto-ai commented 2 weeks ago

You should tilt your max_length to fit your VRAM... If you still OOM, you should use DDP or FSDP. Also, another suggestion would be saving the logits offline from the teacher model, so you can save a lot of VRAM.

JinYu1998 commented 2 weeks ago

@avemio-digital Hi, I also have this problem, how did you solve it? @fernando-neto-ai