IVGSZ / Flash-VStream

This is the official implementation of "Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams"
https://invinciblewyq.github.io/vstream-page/
Apache License 2.0
105 stars 5 forks source link

Question About Training #6

Open SaMMyCHoo opened 1 month ago

SaMMyCHoo commented 1 month ago

Hi there, I'm very interested in your work, and I am trying to train the model from the beginning following the instruction. However, I'm having trouble running the code, it says "AttributeError: 'VStreamConfig' object has no attribute 'mm_hidden_size'. Did you mean: 'hidden_size'?"

After encountering this error, I followed the advice, change the mm_hidden_size into hidden_size. Then I'm encountering: "TypeError: build_vision_projector() missing 1 required positional argument: 'input_dim'". Now I have no idea how to solve this.

I'm wondering if you could provide some help. I'd appreciated it if you could reply as soon as possible. Best regards.

zhang9302002 commented 1 month ago

Hello! Thanks for your attention. mm_hidden_size is the hidden size of visual encoder (CLIP ViT). hidden_size is the hidden size of LLM (vicuna). Please add these lines to config.json of the pretrained LLM folder:

  "mm_hidden_size": 1024,
  "mm_projector_type": "mlp2x_gelu",
  "mm_resampler_type": null,
  "mm_use_im_patch_token": false,
  "mm_use_im_start_end": false,
  "mm_vision_select_feature": "patch",
  "mm_vision_select_layer": -2,
  "mm_vision_tower": "./ckpt/clip-vit-large-patch14",

(You can set mm_vision_tower to any appropriate local path or url path)

SaMMyCHoo commented 1 month ago

Thanks!