dvlab-research / LLaMA-VID

LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models (ECCV 2024)
Apache License 2.0
693 stars 43 forks source link

why delay_load in build_vision_tower(config, delay_load=True)? #47

Closed dragen1860 closed 8 months ago

dragen1860 commented 9 months ago

dear author: Q1: I find you can create vision tower in init stage. However, the delay_load=True is really confusing. why create the EVA network without loading?

image

Q2: i notice you used both mm_vision_tower and vision_tower hyperparameters in config, however, I really confused why use these two? are they the same? image

yanwei-li commented 8 months ago

Hi, we use LLaVA as our pipeline. This is the function in LLaVA. Please refer to LLaVA for this issue.