THUDM / VisualGLM-6B

Chinese and English multimodal conversational language model | 多模态中英双语对话语言模型
Apache License 2.0
4.08k stars 416 forks source link

Lora微调报错,fp16 is not supported #348

Closed zousss closed 6 months ago

zousss commented 6 months ago

在p100 16G显存上执行bash finetune/finetune_visualglm.sh报错fp16 is not supported.这是P100显卡不支持fp16这种数据格式吗,该怎么改?int8,fp32都试过,直接报这两参数错误 [2024-03-28 10:03:33,213] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.14.0, git-hash=unknown, git-branch=unknown [2024-03-28 10:03:33,214] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead Traceback (most recent call last): File "/kaggle/tmp/VisualGLM-6B/finetune_visualglm.py", line 194, in training_main(args, model_cls=model, forward_step_function=forward_step, create_dataset_function=create_dataset_function, collate_fn=data_collator) File "/opt/conda/lib/python3.10/site-packages/sat/training/deepspeed_training.py", line 98, in training_main model, optimizer = setup_model_untrainable_params_and_optimizer(args, model) File "/opt/conda/lib/python3.10/site-packages/sat/training/deepspeed_training.py", line 174, in setup_model_untrainable_params_andoptimizer model, optimizer, , _ = deepspeed.initialize( File "/opt/conda/lib/python3.10/site-packages/deepspeed/init.py", line 176, in initialize engine = DeepSpeedEngine(args=args, File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 240, in init self._do_sanity_check() File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1040, in _do_sanity_check raise ValueError("Type fp16 is not supported.") ValueError: Type fp16 is not supported. b1e7502f3ce7:545:579 [0] NCCL INFO [Service thread] Connection closed by localRank 0 b1e7502f3ce7:545:545 [0] NCCL INFO comm 0x5a08a22b4710 rank 0 nranks 1 cudaDev 0 busId 40 - Abort COMPLETE [2024-03-28 10:03:36,205] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 545 [2024-03-28 10:03:36,206] [ERROR] [launch.py:322:sigkill_handler] ['/opt/conda/bin/python3.10', '-u', 'finetune_visualglm.py', '--local_rank=0', '--experiment-name', 'finetune-visualglm-6b', '--model-parallel-size', '1', '--mode', 'finetune', '--train-iters', '150', '--resume-dataloader', '--max_source_length', '64', '--max_target_length', '256', '--lora_rank', '10', '--layer_range', '0', '14', '--pre_seq_len', '4', '--train-data', './fewshot-data/dataset.json', '--valid-data', './fewshot-data/dataset.json', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--warmup', '.02', '--checkpoint-activations', '--save-interval', '100', '--eval-interval', '10000', '--save', './checkpoints', '--split', '1', '--eval-iters', '2', '--eval-batch-size', '2', '--zero-stage', '1', '--lr', '0.0001', '--batch-size', '2', '--skip-init', '--fp16', '--use_lora'] exits with return code = 1

zousss commented 6 months ago

add and fix deepspeed config ,such as zero3 ,update "train_batch_size": auto to "train_batch_size": 1