dvlab-research / LLaMA-VID

Official Implementation for LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
Apache License 2.0
623 stars 40 forks source link

you build `build_vision_tower` twice? #42

Closed dragen1860 closed 5 months ago

dragen1860 commented 6 months ago
image
  1. when create the class , build it the first time.
  2. after that, you call initialize_vision_modules in train() in train.py.
    if model_args.vision_tower is not None:
        model.get_model().initialize_vision_modules(
            model_args=model_args,
            fsdp=training_args.fsdp,
            max_token=training_args.model_max_length
        )

i found you build it twice, is my understanding correct?

yanwei-li commented 5 months ago

Hi, we use LLaVA as our pipeline. This is the function in LLaVA. Please refer to LLaVA for this issue.