THUDM / GLM-4

GLM-4 series: Open Multilingual Multimodal Chat LMs | 开源多语言多模态对话模型
Apache License 2.0
5.17k stars 429 forks source link

GLM-4V-9B继续微调报错,显示超显存 #408

Closed HouYueJie closed 3 months ago

HouYueJie commented 3 months ago

System Info / 系統信息

torch 2.3.0 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 8.9.2.26 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-ml-py 12.555.43 nvidia-nccl-cu12 2.20.5 nvidia-nvjitlink-cu12 12.5.82 nvidia-nvtx-cu12 12.1.105 deepspeed 0.14.4 bitsandbytes 0.43.1 transformers 4.42.4 datasets 2.20.0

Who can help? / 谁可以帮助到您?

No response

Information / 问题信息

Reproduction / 复现过程

如下是报错内容: rank5: │ /home/WCzhou/workspace/40_backup/VLM_paper/GLM-4/finetune_demo/finetune_vision.py:535 in main │ rank5: │ │ rank5: │ 532 │ │ │ │ model.enable_input_require_grads() │ rank5: │ 533 │ │ │ │ checkpoint_directory = os.path.join(output_dir, "checkpoint-" + str(chec │ rank5: │ 534 │ │ │ │ print("resume checkpoint from checkpoint-" + str(checkpoint_sn)) │ rank5: │ ❱ 535 │ │ │ │ trainer.train(resume_from_checkpoint=checkpoint_directory) │ rank5: │ 536 │ │ │ else: │ rank5: │ 537 │ │ │ │ trainer.train() │ rank5: │ 538 │ │ else: │ rank5: │ │ rank5: │ /home/WCzhou/anaconda3/envs/GLM/lib/python3.10/site-packages/transformers/trainer.py:1932 in │ rank5: │ train │ rank5: │ │ rank5: │ 1929 │ │ │ finally: │ rank5: │ 1930 │ │ │ │ hf_hub_utils.enable_progress_bars() │ rank5: │ 1931 │ │ else: │ rank5: │ ❱ 1932 │ │ │ return inner_training_loop( │ rank5: │ 1933 │ │ │ │ args=args, │ rank5: │ 1934 │ │ │ │ resume_from_checkpoint=resume_from_checkpoint, │ rank5: │ 1935 │ │ │ │ trial=trial, │ rank5: │ │ rank5: │ /home/WCzhou/anaconda3/envs/GLM/lib/python3.10/site-packages/transformers/trainer.py:2268 in │ rank5: │ _inner_training_loop │ rank5: │ │ rank5: │ 2265 │ │ │ │ │ self.control = self.callback_handler.on_step_begin(args, self.state, │ rank5: │ 2266 │ │ │ │ │ rank5: │ 2267 │ │ │ │ with self.accelerator.accumulate(model): │ rank5: │ ❱ 2268 │ │ │ │ │ tr_loss_step = self.training_step(model, inputs) │ rank5: │ 2269 │ │ │ │ │ rank5: │ 2270 │ │ │ │ if ( │ rank5: │ 2271 │ │ │ │ │ args.logging_nan_inf_filter │ rank5: │ │ rank5: │ /home/WCzhou/workspace/40_backup/VLM_paper/GLM-4/finetune_demo/finetune_vision.py:67 in │ rank5: │ training_step │ rank5: │ │ rank5: │ 64 │ │ inputs = self._prepare_inputs(inputs) │ rank5: │ 65 │ │ │ rank5: │ 66 │ │ with self.compute_loss_context_manager(): │ rank5: │ ❱ 67 │ │ │ loss = self.compute_loss(model, inputs) │ rank5: │ 68 │ │ │ rank5: │ 69 │ │ if self.args.n_gpu > 1: │ rank5: │ 70 │ │ │ loss = loss.mean() │ rank5: │ │ rank5: │ /home/WCzhou/anaconda3/envs/GLM/lib/python3.10/site-packages/transformers/trainer.py:3338 in │ rank5: │ compute_loss │ rank5: │ │ rank5: │ 3335 │ │ │ labels = inputs.pop("labels") │ rank5: │ 3336 │ │ else: │ rank5: │ 3337 │ │ │ labels = None │ rank5: │ ❱ 3338 │ │ outputs = model(inputs) │ rank5: │ 3339 │ │ # Save past state if it exists │ rank5: │ 3340 │ │ # TODO: this needs to be fixed and made cleaner later. │ rank5: │ 3341 │ │ if self.args.past_index >= 0: │ rank5: │ │ rank5: │ /home/WCzhou/anaconda3/envs/GLM/lib/python3.10/site-packages/torch/nn/modules/module.py:1532 in │ rank5: │ _wrapped_call_impl │ rank5: │ 1529 │ │ if self._compiled_call_impl is not None: │ rank5: │ 1530 │ │ │ return self._compiled_call_impl(*args, *kwargs) # type: ignore[misc] │ rank5: │ 1531 │ │ else: │ rank5: │ ❱ 1532 │ │ │ return self._call_impl(args, kwargs) │ rank5: │ 1533 │ │ rank5: │ 1534 │ def _call_impl(self, *args, kwargs): │ rank5: │ 1535 │ │ forward_call = (self._slow_forward if torch._C._get_tracing_state() else self.fo │ rank5: │ │ rank5: │ /home/WCzhou/anaconda3/envs/GLM/lib/python3.10/site-packages/torch/nn/modules/module.py:1541 in │ rank5: │ _call_impl │ rank5: │ │ rank5: │ 1538 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │ rank5: │ 1539 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │ rank5: │ 1540 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │ rank5: │ ❱ 1541 │ │ │ return forward_call(*args, *kwargs) │ rank5: │ 1542 │ │ │ rank5: │ 1543 │ │ try: │ rank5: │ 1544 │ │ │ result = None │ rank5: │ │ rank5: │ /home/WCzhou/anaconda3/envs/GLM/lib/python3.10/site-packages/torch/nn/parallel/distributed.py:15 │ rank5: │ 93 in forward │ rank5: │ │ rank5: │ 1590 │ │ │ output = ( │ rank5: │ 1591 │ │ │ │ self.module.forward(inputs, kwargs) │ rank5: │ 1592 │ │ │ │ if self._delay_all_reduce_all_params │ rank5: │ ❱ 1593 │ │ │ │ else self._run_ddp_forward(*inputs, kwargs) │ rank5: │ 1594 │ │ │ ) │ rank5: │ 1595 │ │ │ return self._post_forward(output) │ rank5: │ 1596 │ rank5: │ │ rank5: │ /home/WCzhou/anaconda3/envs/GLM/lib/python3.10/site-packages/torch/nn/parallel/distributed.py:14 │ rank5: │ 11 in _run_ddp_forward │ rank5: │ │ rank5: │ 1408 │ │ │ return self.module(*inputs, *kwargs) # type: ignore[index] │ rank5: │ 1409 │ │ else: │ rank5: │ 1410 │ │ │ with self._inside_ddp_forward(): │ rank5: │ ❱ 1411 │ │ │ │ return self.module(inputs, kwargs) # type: ignore[index] │ rank5: │ 1412 │ │ rank5: │ 1413 │ def _clear_grad_buffer(self): │ rank5: │ 1414 │ │ # Making param.grad points to the grad buffers before backward is based on the │ rank5: │ │ rank5: │ /home/WCzhou/anaconda3/envs/GLM/lib/python3.10/site-packages/torch/nn/modules/module.py:1532 in │ rank5: │ _wrapped_call_impl │ rank5: │ │ rank5: │ 1529 │ │ if self._compiled_call_impl is not None: │ rank5: │ 1530 │ │ │ return self._compiled_call_impl(*args, kwargs) # type: ignore[misc] │ rank5: │ 1531 │ │ else: │ rank5: │ ❱ 1532 │ │ │ return self._call_impl(*args, *kwargs) │ rank5: │ 1533 │ │ rank5: │ 1534 │ def _call_impl(self, args, kwargs): │ rank5: │ 1535 │ │ forward_call = (self._slow_forward if torch._C._get_tracing_state() else self.fo │ rank5: │ │ rank5: │ /home/WCzhou/anaconda3/envs/GLM/lib/python3.10/site-packages/torch/nn/modules/module.py:1541 in │ rank5: │ _call_impl │ rank5: │ │ rank5: │ 1538 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │ rank5: │ 1539 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │ rank5: │ 1540 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │ rank5: │ ❱ 1541 │ │ │ return forward_call(args, kwargs) │ rank5: │ 1542 │ │ │ rank5: │ 1543 │ │ try: │ rank5: │ 1544 │ │ │ result = None │ rank5: │ /home/WCzhou/anaconda3/envs/GLM/lib/python3.10/site-packages/peft/peft_model.py:1430 in forward │ rank5: │ │ rank5: │ 1427 │ │ │ │ rank5: │ 1428 │ │ │ with self._enable_peft_forward_hooks(kwargs): │ rank5: │ 1429 │ │ │ │ kwargs = {k: v for k, v in kwargs.items() if k not in self.specialpeftrank5: │ ❱ 1430 │ │ │ │ return self.base_model( │ rank5: │ 1431 │ │ │ │ │ input_ids=input_ids, │ rank5: │ 1432 │ │ │ │ │ attention_mask=attention_mask, │ rank5: │ 1433 │ │ │ │ │ inputs_embeds=inputs_embeds, │ rank5: │ │ rank5: │ /home/WCzhou/anaconda3/envs/GLM/lib/python3.10/site-packages/torch/nn/modules/module.py:1532 in │ rank5: │ _wrapped_call_impl │ rank5: │ │ rank5: │ 1529 │ │ if self._compiled_call_impl is not None: │ rank5: │ 1530 │ │ │ return self._compiled_call_impl(args, kwargs) # type: ignore[misc] │ rank5: │ 1531 │ │ else: │ rank5: │ ❱ 1532 │ │ │ return self._call_impl(*args, *kwargs) │ rank5: │ 1533 │ │ rank5: │ 1534 │ def _call_impl(self, args, kwargs): │ rank5: │ 1535 │ │ forward_call = (self._slow_forward if torch._C._get_tracing_state() else self.fo │ rank5: │ │ rank5: │ /home/WCzhou/anaconda3/envs/GLM/lib/python3.10/site-packages/torch/nn/modules/module.py:1541 in │ rank5: │ _call_impl │ rank5: │ │ rank5: │ 1538 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │ rank5: │ 1539 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │ rank5: │ 1540 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │ rank5: │ ❱ 1541 │ │ │ return forward_call(*args, kwargs) │ rank5: │ 1542 │ │ │ rank5: │ 1543 │ │ try: │ rank5: │ 1544 │ │ │ result = None │ rank5: │ │ rank5: │ /home/WCzhou/anaconda3/envs/GLM/lib/python3.10/site-packages/peft/tuners/tuners_utils.py:179 in │ rank5: │ forward │ rank5: │ │ rank5: │ 176 │ │ return self.active_adapter │ rank5: │ 177 │ │ rank5: │ 178 │ def forward(self, *args: Any, *kwargs: Any): │ rank5: │ ❱ 179 │ │ return self.model.forward(args, kwargs) │ rank5: │ 180 │ │ rank5: │ 181 │ def _pre_injection_hook(self, model: nn.Module, config: PeftConfig, adapter_name: st │ rank5: │ 182 │ │ r""" │ rank5: │ │ rank5: │ /home/WCzhou/.cache/huggingface/modules/transformers_modules/glm-4v-9b/modeling_chatglm.py:1179 │ rank5: │ in forward │ rank5: │ │ rank5: │ 1176 │ │ use_cache = use_cache if use_cache is not None else self.config.use_cache │ rank5: │ 1177 │ │ return_dict = return_dict if return_dict is not None else self.config.use_return │ rank5: │ 1178 │ │ │ rank5: │ ❱ 1179 │ │ transformer_outputs = self.transformer( │ rank5: │ 1180 │ │ │ input_ids=input_ids, │ rank5: │ 1181 │ │ │ images=images, │ rank5: │ 1182 │ │ │ position_ids=position_ids, │ rank5: │ │ rank5: │ /home/WCzhou/anaconda3/envs/GLM/lib/python3.10/site-packages/torch/nn/modules/module.py:1532 in │ rank5: │ _wrapped_call_impl │ rank5: │ │ rank5: │ 1529 │ │ if self._compiled_call_impl is not None: │ rank5: │ 1530 │ │ │ return self._compiled_call_impl(*args, kwargs) # type: ignore[misc] │ rank5: │ 1531 │ │ else: │ rank5: │ ❱ 1532 │ │ │ return self._call_impl(*args, *kwargs) │ rank5: │ 1533 │ │ rank5: │ 1534 │ def _call_impl(self, args, kwargs): │ rank5: │ 1535 │ │ forward_call = (self._slow_forward if torch._C._get_tracing_state() else self.fo │ rank5: │ │ rank5: │ /home/WCzhou/anaconda3/envs/GLM/lib/python3.10/site-packages/torch/nn/modules/module.py:1541 in │ rank5: │ _call_impl │ rank5: │ │ rank5: │ 1538 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │ rank5: │ 1539 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │ rank5: │ 1540 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │ rank5: │ ❱ 1541 │ │ │ return forward_call(*args, kwargs) │ rank5: │ 1542 │ │ │ rank5: │ 1543 │ │ try: │ rank5: │ 1544 │ │ │ result = None │ rank5: │ │ rank5: │ /home/WCzhou/.cache/huggingface/modules/transformers_modules/glm-4v-9b/modeling_chatglm.py:973 │ rank5: │ in forward │ rank5: │ │ rank5: │ 970 │ │ │ │ inputs_embeds = self.embedding(input_ids) │ rank5: │ 971 │ │ │ │ │ rank5: │ 972 │ │ │ │ images = images.to(dtype=inputs_embeds.dtype) │ rank5: │ ❱ 973 │ │ │ │ images_features = self.vision(images) │ rank5: │ 974 │ │ │ │ │ rank5: │ 975 │ │ │ │ if position_ids is None: │ rank5: │ 976 │ │ │ │ │ position_ids = self.get_position_ids(input_ids, device=inputs_embeds │ rank5: │ │ rank5: │ /home/WCzhou/anaconda3/envs/GLM/lib/python3.10/site-packages/torch/nn/modules/module.py:1532 in │ rank5: │ _wrapped_call_impl │ rank5: │ │ rank5: │ 1529 │ │ if self._compiled_call_impl is not None: │ rank5: │ 1530 │ │ │ return self._compiled_call_impl(*args, *kwargs) # type: ignore[misc] │ rank5: │ 1531 │ │ else: │ rank5: │ ❱ 1532 │ │ │ return self._call_impl(args, kwargs) │ rank5: │ 1533 │ │ rank5: │ 1534 │ def _call_impl(self, *args, kwargs): │ rank5: │ 1535 │ │ forward_call = (self._slow_forward if torch._C._get_tracing_state() else self.fo │ rank5: │ │ rank5: │ /home/WCzhou/anaconda3/envs/GLM/lib/python3.10/site-packages/torch/nn/modules/module.py:1541 in │ rank5: │ _call_impl │ rank5: │ │ rank5: │ 1538 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │ rank5: │ 1539 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │ rank5: │ 1540 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │ rank5: │ ❱ 1541 │ │ │ return forward_call(*args, *kwargs) │ rank5: │ 1542 │ │ │ rank5: │ 1543 │ │ try: │ rank5: │ 1544 │ │ │ result = None │ rank5: │ │ rank5: │ /home/WCzhou/.cache/huggingface/modules/transformers_modules/glm-4v-9b/visual.py:164 in forward │ rank5: │ │ rank5: │ 161 │ │ rank5: │ 162 │ def forward(self, images: "tensor(B, C, H, W)") -> "tensor(B, L, D)": │ rank5: │ 163 │ │ x = self.patch_embedding(images) │ rank5: │ ❱ 164 │ │ x = self.transformer(x) │ rank5: │ 165 │ │ x = x[:, 1:] │ rank5: │ 166 │ │ │ rank5: │ 167 │ │ b, s, h = x.shape │ rank5: │ │ rank5: │ /home/WCzhou/anaconda3/envs/GLM/lib/python3.10/site-packages/torch/nn/modules/module.py:1532 in │ rank5: │ _wrapped_call_impl │ rank5: │ │ rank5: │ 1529 │ │ if self._compiled_call_impl is not None: │ rank5: │ 1530 │ │ │ return self._compiled_call_impl(args, kwargs) # type: ignore[misc] │ rank5: │ 1531 │ │ else: │ rank5: │ ❱ 1532 │ │ │ return self._call_impl(*args, kwargs) │ rank5: │ 1533 │ │ rank5: │ 1534 │ def _call_impl(self, *args, *kwargs): │ rank5: │ 1535 │ │ forward_call = (self._slow_forward if torch._C._get_tracing_state() else self.fo │ rank5: │ │ rank5: │ /home/WCzhou/anaconda3/envs/GLM/lib/python3.10/site-packages/torch/nn/modules/module.py:1541 in │ rank5: │ _call_impl │ rank5: │ 1538 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │ rank5: │ 1539 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │ rank5: │ 1540 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │ rank5: │ ❱ 1541 │ │ │ return forward_call(args, kwargs) │ rank5: │ 1542 │ │ │ rank5: │ 1543 │ │ try: │ rank5: │ 1544 │ │ │ result = None │ rank5: │ │ rank5: │ /home/WCzhou/.cache/huggingface/modules/transformers_modules/glm-4v-9b/visual.py:126 in forward │ rank5: │ │ rank5: │ 123 │ │ rank5: │ 124 │ def forward(self, hidden_states): │ rank5: │ 125 │ │ for layer_module in self.layers: │ rank5: │ ❱ 126 │ │ │ hidden_states = layer_module(hidden_states) │ rank5: │ 127 │ │ return hidden_states │ rank5: │ 128 │ rank5: │ 129 │ rank5: │ │ rank5: │ /home/WCzhou/anaconda3/envs/GLM/lib/python3.10/site-packages/torch/nn/modules/module.py:1532 in │ rank5: │ _wrapped_call_impl │ rank5: │ │ rank5: │ 1529 │ │ if self._compiled_call_impl is not None: │ rank5: │ 1530 │ │ │ return self._compiled_call_impl(*args, kwargs) # type: ignore[misc] │ rank5: │ 1531 │ │ else: │ rank5: │ ❱ 1532 │ │ │ return self._call_impl(*args, *kwargs) │ rank5: │ 1533 │ │ rank5: │ 1534 │ def _call_impl(self, args, kwargs): │ rank5: │ 1535 │ │ forward_call = (self._slow_forward if torch._C._get_tracing_state() else self.fo │ rank5: │ │ rank5: │ /home/WCzhou/anaconda3/envs/GLM/lib/python3.10/site-packages/torch/nn/modules/module.py:1541 in │ rank5: │ _call_impl │ rank5: │ │ rank5: │ 1538 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │ rank5: │ 1539 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │ rank5: │ 1540 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │ rank5: │ ❱ 1541 │ │ │ return forward_call(*args, kwargs) │ rank5: │ 1542 │ │ │ rank5: │ 1543 │ │ try: │ rank5: │ 1544 │ │ │ result = None │ rank5: │ │ rank5: │ /home/WCzhou/.cache/huggingface/modules/transformers_modules/glm-4v-9b/visual.py:109 in forward │ rank5: │ │ rank5: │ 106 │ │ rank5: │ 107 │ def forward(self, hidden_states): │ rank5: │ 108 │ │ attention_input = hidden_states │ rank5: │ ❱ 109 │ │ attention_output = self.input_layernorm(self.attention(attention_input)) │ rank5: │ 110 │ │ hidden_states = attention_input + attention_output │ rank5: │ 111 │ │ mlp_input = hidden_states │ rank5: │ 112 │ rank5: │ │ rank5: │ /home/WCzhou/anaconda3/envs/GLM/lib/python3.10/site-packages/torch/nn/modules/module.py:1532 in │ rank5: │ _wrapped_call_impl │ rank5: │ │ rank5: │ 1529 │ │ if self._compiled_call_impl is not None: │ rank5: │ 1530 │ │ │ return self._compiled_call_impl(*args, *kwargs) # type: ignore[misc] │ rank5: │ 1531 │ │ else: │ rank5: │ ❱ 1532 │ │ │ return self._call_impl(args, kwargs) │ rank5: │ 1533 │ │ rank5: │ 1534 │ def _call_impl(self, *args, kwargs): │ rank5: │ 1535 │ │ forward_call = (self._slow_forward if torch._C._get_tracing_state() else self.fo │ rank5: │ /home/WCzhou/anaconda3/envs/GLM/lib/python3.10/site-packages/torch/nn/modules/module.py:1541 in │ rank5: │ _call_impl │ rank5: │ │ rank5: │ 1538 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │ rank5: │ 1539 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │ rank5: │ 1540 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │ rank5: │ ❱ 1541 │ │ │ return forward_call(*args, *kwargs) │ rank5: │ 1542 │ │ │ rank5: │ 1543 │ │ try: │ rank5: │ 1544 │ │ │ result = None │ rank5: │ │ rank5: │ /home/WCzhou/.cache/huggingface/modules/transformersmodules/glm-4v-9b/visual.py:66 in forward │ rank5: │ │ rank5: │ 63 │ │ rank5: │ 64 │ def forward(self, x: "tensor(B, L, D)") -> "tensor(B, L, D)": │ rank5: │ 65 │ │ B, L, = x.shape │ rank5: │ ❱ 66 │ │ qkv = self.query_key_value(x) │ rank5: │ 67 │ │ qkv = qkv.reshape(B, L, 3, self.num_heads, -1).permute(2, 0, 3, 1, 4) # 3, B, H │ rank5: │ 68 │ │ q, k, v = qkv[0], qkv[1], qkv[2] │ rank5: │ 69 │ rank5: │ │ rank5: │ /home/WCzhou/anaconda3/envs/GLM/lib/python3.10/site-packages/torch/nn/modules/module.py:1532 in │ rank5: │ _wrapped_call_impl │ rank5: │ │ rank5: │ 1529 │ │ if self._compiled_call_impl is not None: │ rank5: │ 1530 │ │ │ return self._compiled_call_impl(args, kwargs) # type: ignore[misc] │ rank5: │ 1531 │ │ else: │ rank5: │ ❱ 1532 │ │ │ return self._call_impl(*args, kwargs) │ rank5: │ 1533 │ │ rank5: │ 1534 │ def _call_impl(self, *args, *kwargs): │ rank5: │ 1535 │ │ forward_call = (self._slow_forward if torch._C._get_tracing_state() else self.fo │ rank5: │ │ rank5: │ /home/WCzhou/anaconda3/envs/GLM/lib/python3.10/site-packages/torch/nn/modules/module.py:1541 in │ rank5: │ _call_impl │ rank5: │ │ rank5: │ 1538 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │ rank5: │ 1539 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │ rank5: │ 1540 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │ rank5: │ ❱ 1541 │ │ │ return forward_call(args, kwargs) │ rank5: │ 1542 │ │ │ rank5: │ 1543 │ │ try: │ rank5: │ 1544 │ │ │ result = None │ rank5: │ │ rank5: │ /home/WCzhou/anaconda3/envs/GLM/lib/python3.10/site-packages/peft/tuners/lora/layer.py:569 in │ rank5: │ forward │ rank5: │ │ rank5: │ 566 │ │ │ │ x = x.to(lora_A.weight.dtype) │ rank5: │ 567 │ │ │ │ │

rank5: │ ❱ 569 │ │ │ │ │ result = result + lora_B(lora_A(dropout(x))) * scaling │ rank5: │ 570 │ │ │ │ else: │ rank5: │ 571 │ │ │ │ │ x = dropout(x) │ rank5: │ 572 │ │ │ │ │ result = result + self._apply_dora(x, lora_A, lora_B, scaling, activ │

rank5: OutOfMemoryError: CUDA out of memory. Tried to allocate 66.00 MiB. GPU has a total capacity of 79.14 GiB of which 25.69 MiB is free. Process 8674 has 28.09 GiB memory in use. Including non-PyTorch memory, this process has 50.95 GiB memory in use. Of the allocated memory rank5: 49.39 GiB is allocated by PyTorch, and 825.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management

W0729 14:27:21.232000 139661001613952 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 409509 closing signal SIGTERM W0729 14:27:21.235000 139661001613952 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 409510 closing signal SIGTERM W0729 14:27:21.248000 139661001613952 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 409511 closing signal SIGTERM W0729 14:27:21.253000 139661001613952 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 409512 closing signal SIGTERM W0729 14:27:21.264000 139661001613952 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 409513 closing signal SIGTERM W0729 14:27:21.301000 139661001613952 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 409515 closing signal SIGTERM E0729 14:27:26.256000 139661001613952 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 5 (pid: 409514) of binary: /home/WCzhou/anaconda3/envs/GLM/bin/python Traceback (most recent call last): File "/home/WCzhou/anaconda3/envs/GLM/bin/torchrun", line 8, in sys.exit(main()) File "/home/WCzhou/anaconda3/envs/GLM/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper return f(*args, **kwargs) File "/home/WCzhou/anaconda3/envs/GLM/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/home/WCzhou/anaconda3/envs/GLM/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/home/WCzhou/anaconda3/envs/GLM/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/WCzhou/anaconda3/envs/GLM/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/home/WCzhou/workspace/40_backup/VLM_paper/GLM-4/finetune_demo/finetune_vision.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-29_14:27:21 host : h3c rank : 5 (local_rank: 5) exitcode : 1 (pid: 409514) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ 直观上看nvitop,是会放到0卡中堆积,其他也部分占有。 ![image](https://github.com/user-attachments/assets/eb55b7d6-6190-4167-9a91-258da9524322) ### Expected behavior / 期待表现 理论上应该会均摊到各个GPU上,请问可能是什么情况?
zRzRzRzRzRzRzR commented 3 months ago

什么叫做继续微调?就是一开始微调用这个脚本是不炸显存的吗

elesun2018 commented 2 months ago

GLM-4.0731.gitraw writer_batch_size=1 batch_size=1 GLM-4/finetune_demo# CUDA_VISIBLE_DEVICES=1 python finetune_vision.py 还是报错 OutOfMemoryError: CUDA out of memory. Tried to allocate 1.22 GiB (GPU 0; 47.54 GiB total capacity; 44.83 GiB already allocated; 1.07 GiB free; 46.13 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

elesun2018 commented 2 months ago

OutOfMemoryError: CUDA out of memory的问题 请问是不是要更新代码还是要更新模型文件?

elesun2018 commented 2 months ago

0821更新代码和更新模型文件没有用,仍然OutOfMemoryError 请问finetune最低显存多少?应该如何配置?这个问题还应该从何排查(环境版本?显存爆处的代码)

elesun2018 commented 3 weeks ago

image 请问可否使用Deepspeed 使用2*A100将75G显存的训练参数分配到两张卡上面,再使用Zero3进行进一步显存优化。是否可行? 有没有支持Deepspeed Zero3的finetune_vision.py

elesun2018 commented 1 week ago

finetune_vision.py修改后支持zero23优化,但是训练太慢了 新代码finetune_vision.py冻结vit后训练显存只需28G-35G,就是不知道微调后效果如何。