微调训练视频数据读取问题

在二阶段微调训练时，会输出video有问题的信息：

Error loading data at index 23971: Video not found: /dev/shm/vlm/MiniGPT4Qwen/cache/dataset/videochatgpt/activitynet_videos/v_IrTqW6Qn8mI.mp4
Error loading data at index 71068: Video not found: /dev/shm/vlm/MiniGPT4Qwen/cache/dataset/videochatgpt/activitynet_videos/v_aV5DMcsNMmk.mp4
Error loading data at index 51648: Video not found: /dev/shm/vlm/MiniGPT4Qwen/cache/dataset/videochatgpt/activitynet_videos/v_MlbM7Mew0Ys.mp4
Error loading data at index 80768: 'video'
Error loading data at index 29235: Video not found: /dev/shm/vlm/MiniGPT4Qwen/cache/dataset/videochatgpt/activitynet_videos/v_AA1wvSZ4Mno.mp4
Error loading data at index 81059: 'video'
Error loading data at index 99616: 'video'
Error loading data at index 80812: 'video'
Error loading data at index 91812: 'video'
Error loading data at index 20455: Video not found: /dev/shm/vlm/MiniGPT4Qwen/cache/dataset/videochatgpt/activitynet_videos/v_J_SD_hhGET8.mp4
......

分析发现是一些视频数据没有截取到图像帧，在这里的ret会返回False（一部分视频返回False，其他视频能够正常返回True，返回False的视频路径对应的视频存在于数据集中）： https://github.com/Coobiw/MPP-LLaVA/blob/cfd419c3a156f747fe25871e6a1eeb4beeb9fe0c/lavis/datasets/datasets/video_instructions.py#L43

导致输出信息 https://github.com/Coobiw/MPP-LLaVA/blob/cfd419c3a156f747fe25871e6a1eeb4beeb9fe0c/lavis/datasets/datasets/video_instructions.py#L55

说明这里没有在视频中截取到图像，但是我把报错视频下载下来，发现视频没有问题。现在不知道问题出在哪里。

附配置文件 sft.yaml :

model:
  arch: minigpt4qwen
  model_type: qwen7b_chat
  load_finetuned: True
  load_pretrained: True

  # pretrained: "https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/InstructBLIP/blip2_pretrained_flant5xxl.pth"
  pretrained: "ckpt/blip2/blip2_pretrained_flant5xxl.pth"
  finetuned: "/dev/shm/vlm/MiniGPT4Qwen/lavis/output/pp_7b_video/pretrain/global_step295/model.pth"

  # vit encoder
  vit_model: "eva_clip_g"
  image_size: 224
  drop_path_rate: 0
  use_grad_checkpoint: True
  vit_precision: "fp16"  # 如果你要打开vit进行训练，这里需要调整成fp32,否则如果开启amp混合精度训练会有问题（在scaler处报错,因为没有实现一个fp16的AdamW）
  freeze_vit: True
  unfreeze_pos_embed: False

  # Q-Former
  num_query_token: 32
  qformer_text_input: False
  freeze_qformer: True
  freeze_queries: True

  # projection
  freeze_proj: False

  # path to Vicuna checkpoint
  llm_model: "/dev/shm/vlm/MiniGPT4Qwen/cache/ckpt/Qwen-7B-Chat"

  # unfreeze LLM for better chat
  freeze_llm: False

  # lora config
  get_lora: False
  lora_alpha: 32
  lora_r: 8
  lora_dropout: 0.05

  # text length when training
  max_txt_len: 1536 # 512

  # enable autocast of vit
  enable_autocast: False

datasets:
  llava_instruct_156k: # name of the dataset builder
    vis_processor:
        train:
          name: "blip2_image_train"
          image_size: 224
    text_processor:
        train:
          name: "base_instruction"
          max_words: 200

  videochatgpt_100k: # name of the dataset builder
    vis_processor:
        train:
          name: "blip2_image_train"
          image_size: 224
    text_processor:
        train:
          name: "base_instruction"
          max_words: 200

run:
  output_dir: "lavis/output/pp_7b_video/sft_video/"

  task: deepspeed_image_text_pretrain

  num_workers: 4

  seed: 42

  world_size: 1
  dist_url: "env://"
  distributed: True

  max_epoch: 1
  log_freq: 10

  lr_sched: "linear_warmup_cosine_lr_step-wise"
  warmup_lr: 0
  init_lr: 2e-5
  min_lr: 0
  warmup_ratio: 0.1

  deepspeed_config:
    # global batch = 128 = n_ranks * grad_acc_steps * micro_batch_size = (4//2) * 64 * 1
    # 8 x 3090
    # pp=8 dp=1 nproc=pp*dp=8 
    gradient_accumulation_steps: 128 # 128 // dp(=1) // bs_per_gpu(=1) = 128
    train_micro_batch_size_per_gpu: 1

    gradient_clipping: 1.
    steps_per_print: 10
    wall_clock_breakdown: false
    dump_state: False

    fp16:
        enabled: false
        loss_scale: 0
        loss_scale_window: 1000
        initial_scale_power: 16
        hysteresis: 2
        min_loss_scale: 1

    bf16:
        enabled: true

    optimizer:
        type: "AdamW"
        params:
            lr: 2e-5
            betas: [0.9,0.99]
            eps: 1e-7
            weight_decay: 0.

    zero_optimization:
        stage: 0
        # offload_optimizer:
        #   device: "cpu"
        #   pin_memory: true
        allgather_partitions: true
        allgather_bucket_size: 2e8
        overlap_comm: true
        reduce_scatter: true
        reduce_bucket_size: 2e8
        contiguous_gradients: true

Coobiw / MPP-LLaVA

微调训练视频数据读取问题 #30