zhanghang-official commented 1 month ago

训练过程中很快出现loss跳变为0的现象，降低学习率无法解决该问题。

配置文件如下： model: arch: st_llm_hf model_type: instructblip_vicuna0 use_grad_checkpoint: True max_txt_len: 256 end_sym: "###"

prompt_path: "prompts/alignment.txt"

prompt_template: '###Human: {} ###Assistant: ' llama_model: '/root/qfs/lmm/weights/stllm/pretrained/vicuna-7b-v1.1/' ckpt: '/root/qfs/lmm/weights/stllm/pretrained/instruct_blip_vicuna7b_trimmed.pth' q_former_model: '/root/qfs/lmm/weights/stllm/pretrained/instruct_blip_vicuna7b_trimmed.pth' qformer_text_input: True freeze_LLM: False video_input: "residual" residual_size: 16 use_mask : True mvm_decode: True

datasets: caption_体育240402_en: num_frames: 64

run: task: video_text_it bf16: True tf32: False output_dir: "./output/instructblipbase_stllm_conversation" num_train_epochs: 4 dataloader_num_workers: 2 per_device_train_batch_size: 2 per_device_eval_batch_size: 2 gradient_accumulation_steps: 1 evaluation_strategy: "no"

learning_rate: 2e-5

learning_rate: 1e-10 weight_decay: 0.

warmup_ratio: 0.03

warmup_ratio: 0.3 lr_scheduler_type: 'cosine' logging_steps: 1 model_max_length: 1024 save_steps: 3000

save_strategy: "epoch"

save_total_limit: 10 deepspeed: 'stllm/train/zero2.json'

deepspeed: 'stllm/train/zero3.json'

deepspeed: 'stllm/train/zero3_offload.json'

zhanghang-official commented 1 month ago

训练机器是8卡A10040G

farewellthree commented 1 month ago

你好，可以康康是不是visual encoder，qformer或是LLM初始化出了问题

TencentARC / ST-LLM

微调过程中loss问题 #14