Open YasserdahouML opened 1 month ago
Both 4096 x 64 and 64 x 4096 is more efficient than 4096 x 4096 becuase it is simple equation 4096 x 64 + 4096 x 64 < 4096 x 4096
Yes I did try but directly using the aux often makes wrong information embedded because computer vision outputs are not perfrect model without wrong answer.
I am more referring to this https://github.com/ByungKwanLee/MoAI/blob/a7728a8d1c8df27d3221708a4ca4366e271f51c8/moai/arch/build_mlp.py#L211
This line does. not seem to be in internlm, but the interlm2composer2-7b, means you use a vlm already and adapt it right? not an llm only right?
Based on the InternLM2, we tried to employ the image part adaptation to make VLM know where the image part is, and the results show that it somewhat makes better peformances. Therefore, we jointly train it with moai components. we will update the detail of this part and the vision projector training part, both. Nonetheless, we observed that moai provided quitely good performances compared with LLaVA-7B or any other baselines without it.
Do you mean you use plora at the sft stage when training moai modules, or you use the plora weights given by InternLM2 composer? and why not showing the results without this plora part (moai modules only)?
This is the bite of code I am reffering to https://huggingface.co/internlm/internlm-xcomposer2-7b/blob/main/build_mlp.py#L205
I mean I used plora at the sft stage when training moai modules. And, I will abaltion it with and without PLoRA but it gives 2~3% margin
can you refer me to which LLM weights you used please? because I compared the weights of your model to the ones in InternLM2 7B and they are different, does this suggest that you trained the LLM? in fact, they are close to the internlmcomposer2 ones actually, same for the CLIP, similar to internlmcomposer2 rather than the one from openai, can you share more details about the pretraining stage ? what kind of weights you started from when doing the sft stage?
Hello, I saw in the paper
but in the code, we see at https://github.com/ByungKwanLee/MoAI/blob/a7728a8d1c8df27d3221708a4ca4366e271f51c8/moai/arch/expert_module.py#L143
how this saves compute as attention is still performed at d=4096 instead of r=64?
Also, did you try just concatenating the aux info to the Llm inputs along with the image features and the prompt? we see the answer from the repo but did you actually train this way?