OpenBMB / VisCPM

[ICLR'24 spotlight] Chinese and English Multimodal Large Model Series (Chat and Paint) | 基于CPM基础模型的中英双语多模态大模型系列
1.06k stars 93 forks source link

torch.cuda.OutOfMemoryError #23

Closed zxc351200 closed 7 months ago

zxc351200 commented 1 year ago

Traceback (most recent call last): File "/data/CV/caidaigang/model/VisCPM/./finetune/ft_viscpm_chat/train_viscpm_chat.py", line 206, in main() File "/data/CV/caidaigang/model/VisCPM/./finetune/ft_viscpm_chat/train_viscpm_chat.py", line 202, in main train(model, args) File "/data/CV/caidaigang/model/VisCPM/./finetune/ft_viscpm_chat/train_viscpm_chat.py", line 87, in train vllm_engine, vllmoptim, , _ = deepspeed.initialize( File "/data/CV/caidaigang/anaconda3/envs/viscpm/lib/python3.10/site-packages/deepspeed/init.py", line 165, in initialize engine = DeepSpeedEngine(args=args, File "/data/CV/caidaigang/anaconda3/envs/viscpm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 308, in init self._configure_optimizer(optimizer, model_parameters) File "/data/CV/caidaigang/anaconda3/envs/viscpm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1167, in _configure_optimizer self.optimizer = self._configure_zero_optimizer(basic_optimizer) File "/data/CV/caidaigang/anaconda3/envs/viscpm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1398, in _configure_zero_optimizer optimizer = DeepSpeedZeroOptimizer( File "/data/CV/caidaigang/anaconda3/envs/viscpm/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 485, in init self.initialize_optimizer_states() File "/data/CV/caidaigang/anaconda3/envs/viscpm/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 614, in initialize_optimizer_states self.optimizer.step() File "/data/CV/caidaigang/anaconda3/envs/viscpm/lib/python3.10/site-packages/torch/optim/optimizer.py", line 140, in wrapper out = func(*args, **kwargs) File "/data/CV/caidaigang/anaconda3/envs/viscpm/lib/python3.10/site-packages/deepspeed/ops/adam/fused_adam.py", line 129, in step state['exp_avg_sq'] = torch.zeros_like(p.data) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 19.22 GiB (GPU 0; 79.21 GiB total capacity; 76.87 GiB already allocated; 45.56 MiB free; 77.54 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

我在A100 80G上面把batch size改成1仍然有以上问题,请问要怎么解决,有人顺利finetune模型了吗

Cuiunbo commented 1 year ago

你是在一张A100上跑的吗, 我只在8张卡上跑过, 最初在2张卡尝试时候也遇到了和你一样的问题, 增加到了8张卡彻底解决了这个问题

zxc351200 commented 1 year ago

你是在一张A100上跑的吗, 我只在8张卡上跑过, 最初在2张卡尝试时候也遇到了和你一样的问题, 增加到了8张卡彻底解决了这个问题

我在8张卡上试过,都还有这个问题。 不知道是不是代码bug。

rover5056 commented 1 year ago

8卡 1 batch 也出现了oom 。。 @Cuiunbo 求问下你有改deepspeed 的配置么

Cuiunbo commented 1 year ago

你是在一张A100上跑的吗, 我只在8张卡上跑过, 最初在2张卡尝试时候也遇到了和你一样的问题, 增加到了8张卡彻底解决了这个问题

我在8张卡上试过,都还有这个问题。 不知道是不是代码bug。

8卡的报错信息可以发一下吗, 也是在deepspeed初始化就报错吗

zxc351200 commented 1 year ago

File "/data/CV/caidaigang/model/VisCPM/./finetune/ft_viscpm_chat/train_viscpm_chat.py", line 206, in main() File "/data/CV/caidaigang/model/VisCPM/./finetune/ft_viscpm_chat/train_viscpm_chat.py", line 202, in main train(model, args) File "/data/CV/caidaigang/model/VisCPM/./finetune/ft_viscpm_chat/train_viscpm_chat.py", line 134, in train vllm_engine.backward(loss) File "/data/CV/caidaigang/anaconda3/envs/viscpm/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, kwargs) File "/data/CV/caidaigang/anaconda3/envs/viscpm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1796, in backward self.optimizer.backward(loss, retain_graph=retain_graph) File "/data/CV/caidaigang/anaconda3/envs/viscpm/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1873, in backward buf_0 = torch.empty(int(self.reduce_bucket_size), torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 954.00 MiB (GPU 7; 79.21 GiB total capacity; 49.63 GiB already allocated; 844.62 MiB free; 50.24 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Traceback (most recent call last): File "/data/CV/caidaigang/model/VisCPM/./finetune/ft_viscpm_chat/train_viscpm_chat.py", line 206, in main() File "/data/CV/caidaigang/model/VisCPM/./finetune/ft_viscpm_chat/train_viscpm_chat.py", line 202, in main train(model, args) File "/data/CV/caidaigang/model/VisCPM/./finetune/ft_viscpm_chat/train_viscpm_chat.py", line 134, in train vllm_engine.backward(loss) File "/data/CV/caidaigang/anaconda3/envs/viscpm/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, *kwargs) File "/data/CV/caidaigang/anaconda3/envs/viscpm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1796, in backward self.optimizer.backward(loss, retain_graph=retain_graph) File "/data/CV/caidaigang/anaconda3/envs/viscpm/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1873, in backward buf_0 = torch.empty(int(self.reduce_bucket_size), torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 954.00 MiB (GPU 1; 79.21 GiB total capacity; 49.63 GiB already allocated; 268.62 MiB free; 50.24 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Traceback (most recent call last):█████████▎ | 105M/168M [00:09<00:05, 12.2MB/s] File "/data/CV/caidaigang/model/VisCPM/./finetune/ft_viscpm_chat/train_viscpm_chat.py", line 206, in main() File "/data/CV/caidaigang/model/VisCPM/./finetune/ft_viscpm_chat/train_viscpm_chat.py", line 202, in main train(model, args) File "/data/CV/caidaigang/model/VisCPM/./finetune/ft_viscpm_chat/train_viscpm_chat.py", line 134, in train vllm_engine.backward(loss) File "/data/CV/caidaigang/anaconda3/envs/viscpm/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(args, kwargs) File "/data/CV/caidaigang/anaconda3/envs/viscpm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1796, in backward self.optimizer.backward(loss, retain_graph=retain_graph) File "/data/CV/caidaigang/anaconda3/envs/viscpm/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1873, in backward buf_0 = torch.empty(int(self.reduce_bucket_size), torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 954.00 MiB (GPU 0; 79.21 GiB total capacity; 49.63 GiB already allocated; 844.62 MiB free; 50.24 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Traceback (most recent call last): File "/data/CV/caidaigang/model/VisCPM/./finetune/ft_viscpm_chat/train_viscpm_chat.py", line 206, in main() File "/data/CV/caidaigang/model/VisCPM/./finetune/ft_viscpm_chat/train_viscpm_chat.py", line 202, in main train(model, args) File "/data/CV/caidaigang/model/VisCPM/./finetune/ft_viscpm_chat/train_viscpm_chat.py", line 134, in train vllm_engine.backward(loss) File "/data/CV/caidaigang/anaconda3/envs/viscpm/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, kwargs) File "/data/CV/caidaigang/anaconda3/envs/viscpm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1796, in backward self.optimizer.backward(loss, retain_graph=retain_graph) File "/data/CV/caidaigang/anaconda3/envs/viscpm/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1873, in backward buf_0 = torch.empty(int(self.reduce_bucket_size), torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 954.00 MiB (GPU 2; 79.21 GiB total capacity; 49.63 GiB already allocated; 268.62 MiB free; 50.24 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Traceback (most recent call last): File "/data/CV/caidaigang/model/VisCPM/./finetune/ft_viscpm_chat/train_viscpm_chat.py", line 206, in main() File "/data/CV/caidaigang/model/VisCPM/./finetune/ft_viscpm_chat/train_viscpm_chat.py", line 202, in main train(model, args) File "/data/CV/caidaigang/model/VisCPM/./finetune/ft_viscpm_chat/train_viscpm_chat.py", line 134, in train vllm_engine.backward(loss) File "/data/CV/caidaigang/anaconda3/envs/viscpm/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, *kwargs) File "/data/CV/caidaigang/anaconda3/envs/viscpm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1796, in backward self.optimizer.backward(loss, retain_graph=retain_graph) File "/data/CV/caidaigang/anaconda3/envs/viscpm/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1873, in backward buf_0 = torch.empty(int(self.reduce_bucket_size), torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 954.00 MiB (GPU 4; 79.21 GiB total capacity; 49.63 GiB already allocated; 268.62 MiB free; 50.24 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF [2023-08-30 17:58:31,980] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 31576] [2023-08-30 17:58:32,306] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 31577 [2023-08-30 17:58:32,327] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 31578 [2023-08-30 17:58:32,480] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 31579 Traceback (most recent call last): File "/data/CV/caidaigang/model/VisCPM/./finetune/ft_viscpm_chat/train_viscpm_chat.py", line 206, in main() File "/data/CV/caidaigang/model/VisCPM/./finetune/ft_viscpm_chat/train_viscpm_chat.py", line 202, in main train(model, args) File "/data/CV/caidaigang/model/VisCPM/./finetune/ft_viscpm_chat/train_viscpm_chat.py", line 134, in train vllm_engine.backward(loss) File "/data/CV/caidaigang/anaconda3/envs/viscpm/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(args, kwargs) File "/data/CV/caidaigang/anaconda3/envs/viscpm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1796, in backward self.optimizer.backward(loss, retain_graph=retain_graph) File "/data/CV/caidaigang/anaconda3/envs/viscpm/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1873, in backward buf_0 = torch.empty(int(self.reduce_bucket_size), torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 954.00 MiB (GPU 6; 79.21 GiB total capacity; 49.63 GiB already allocated; 268.62 MiB free; 50.24 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF [2023-08-30 17:58:33,194] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 31580 [2023-08-30 17:58:33,215] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 31581 [2023-08-30 17:58:33,938] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 31582 [2023-08-30 17:58:34,412] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 31583 [2023-08-30 17:58:34,412] [ERROR] [launch.py:434:sigkill_handler] ['/data/CV/caidaigang/anaconda3/envs/viscpm/bin/python', '-u', './finetune/ft_viscpm_chat/train_viscpm_chat.py', '--local_rank=7', '--image_path', '/data/CV/datasets/Maths/metis-kantumiaoshu-boxes-1.0', '--text_path', '/data/CV/datasets/Maths/metis-kantumiaoshu-boxes-1.0/new_train_vl.json', '--llm_path', './config/cpm-bee-10b.json', '--exp_name', 'ft_viscpm_chat', '--model_dir', '/data/CV/caidaigang/model/VisCPM/pretrained/VisCPM-Chat/pytorch_model.v1.bin', '--query_num', '64', '--max_len', '1024', '--batch_size', '1', '--save_step', '500', '--epochs', '1', '--deepspeed_config', './finetune/ft_viscpm_chat/config/deepspeed/viscpm_chat_ft.json', '--sft', '--tune_llm', '--tune_vision'] exits with return code = 1

Cuiunbo commented 1 year ago

@zxc351200 我试过拿llava150k在 8卡 / 512 seqlen/ 1 batchsize情况下跑,是没问题的,但看起来这个错误确实是显存不够了,你可以尝试

Cuiunbo commented 1 year ago

@rover5056 没改过配置

zxc351200 commented 1 year ago

我把max_len改小到512,在8卡上面就不爆显存了。 但是我发现finetune目前有一些问题,不知道是哪里的原因,测试是正常的。 每个batch训练输出的logits的和是nan,问题可能出现在position embedding的 self.relative_attention_bias = torch.nn.parameter.Parameter( torch.empty(num_buckets + num_segment_bucket, num_heads, dtype=dtype) ) 但是我换了初始化,输出的结果还是不对,我对比了训练和测试阶段,感觉cpmbee.py里position_bias的值差距挺大的,不知道是不是这里的问题。

zhaoyukoon commented 12 months ago

你们ft的时候有遇到如下问题吗

  File "/data1/miniconda3/envs/viscpm/lib/python3.10/site-packages/VisCPM-0.0.0-py3.10.egg/finetune/dataset/__init__.py", line 1, in <module>
    from VisCPM.dataset.itembuilder import CPMBeeImageTextBuilder
ModuleNotFoundError: No module named 'VisCPM.dataset'
zxc351200 commented 12 months ago

from VisCPM.dataset.itembuilder import CPMBeeImageTextBuilder

from finetune.dataset.itembuilder import CPMBeeImageTextBuilder init.py里面可能需要改一下,因为itembuilder在finetune下面

zhaoyukoon commented 12 months ago

from VisCPM.dataset.itembuilder import CPMBeeImageTextBuilder from finetune.dataset.itembuilder import CPMBeeImageTextBuilder init.py里面可能需要改一下,因为itembuilder在finetune下面

可以了,不过我运行 bash finetune/ft_viscpm_chat/run_viscpm_chat_ft.sh的时候卡死了,在

[2023-09-04,18:20:44][I][43282-__main__-train_viscpm_chat.py:93]- rank=0 load model successful
load raw data from: DatasetDict({
    train: Dataset({
        features: ['image', 'id', 'conversations'],
        num_rows: 157712
    })
})

一直没有动静,你有遇到这个情况吗?

rover5056 commented 12 months ago

from VisCPM.dataset.itembuilder import CPMBeeImageTextBuilder from finetune.dataset.itembuilder import CPMBeeImageTextBuilder init.py里面可能需要改一下,因为itembuilder在finetune下面

可以了,不过我运行 bash finetune/ft_viscpm_chat/run_viscpm_chat_ft.sh的时候卡死了,在

[2023-09-04,18:20:44][I][43282-__main__-train_viscpm_chat.py:93]- rank=0 load model successful
load raw data from: DatasetDict({
    train: Dataset({
        features: ['image', 'id', 'conversations'],
        num_rows: 157712
    })
})

一直没有动静,你有遇到这个情况吗?

reader 里面会把图片一次性全读到内存 可以改一下reader