ByungKwanLee / MoAI

Official PyTorch implementation code for realizing the technical part of Mixture of All Intelligence (MoAI) to improve performance of numerous zero-shot vision language tasks. (Under Review)
MIT License
298 stars 25 forks source link

The training process detail #12

Open lucasjinreal opened 3 months ago

lucasjinreal commented 3 months ago

Hi, did u first train the projector, and then train projector + LLM, what's the detail of them.

ByungKwanLee commented 3 months ago

Oh!, before using external cv models, we slightly trained the projector only with 1 percent of total batch, and then we freeze it. I think it does not depend on the performance quietly, which is also evaluated in the paper MM1, apple paper.

We will add its minor training procedure to the manuscript. Thanks a lot!

lucasjinreal commented 3 months ago

so first trainng proj only , then free proj, and train vit and llm? Why freeze proj at second time?

ByungKwanLee commented 3 months ago

This is because I did not capture its effectiveness of training proj at second stage, at least on our model. The effectiveness of training MoAI-Mixer modules seem more better.

Actually, there is, either, no reason to not train vision encoder of MoAI, while LLaVA1.6 (Microsoft) and MM1 (Apple) trained vision encoder, but MoAI did not adopt vision encoder training and proj training.

lucasjinreal commented 3 months ago

Most of the time open ivison encoder training got worse result. Do u think you can get better results if train them all?

ByungKwanLee commented 3 months ago

I cannot convince that training vision encoder always got worse result. I think it depends on the model setup or training setup. Recently, I read more papers training vision encoder leads to performance gain.

I would recommend reading Figure 10(c) of MM1 paper [link].

In addition, LLaVA1.6-blog [link] described full model training at second stage.

Thanks for great discussion of training detail!

lucasjinreal commented 3 months ago

The MM1 paper claims that open vit only works (better) then un free when imaga tokens are too many.

It might depends on the LLM and proj design

ByungKwanLee commented 3 months ago

Yes, recent papers have had lots of image tokens by using dynamic looking image. As I said, it depends on design of models. It makes sense many image feature can represent more rich onformation for VL tasks.

cassiaaaaaa commented 3 months ago

Sorry to ask a maybe naive question. What does the "projector" you mentioned mean? Is this MLP the projector? "Two linear layers with GELU activation function serve as the bridge connector between vision and language components, denoted by 'MLP'"

ByungKwanLee commented 3 months ago

The projector means a few layers with MLP, of which role is to make a bride connector from vision encoder to backbone multimodal LLM

cassiaaaaaa commented 3 months ago

The projector means a few layers with MLP, of which role is to make a bride connector from vision encoder to backbone multimodal LLM

Thanks for your reply. I noticed that you used qlora in training. Did you use qlora in training the projector? or you just use qlora in training the following steps?

ByungKwanLee commented 3 months ago

No reason to quatize projector. Thus, used backbone llm only

cassiaaaaaa commented 3 months ago

Thanks very much. Sorry to bother you again. Another question, did you use InternLM-7B or InternLM2-7B or InternLM2-base-7b as base model? And did you base "InternLM/lmdeploy" to write the training model?

ByungKwanLee commented 3 months ago

We used https://huggingface.co/internlm/internlm2-7b getting the highest likes

cassiaaaaaa commented 3 months ago

We used https://huggingface.co/internlm/internlm2-7b getting the highest likes

thanks very much.

cassiaaaaaa commented 3 months ago

Sorry to bother again, I noticed that you wrote all training is based on Llava-instruct-655k filtered by ShareGPT-4v. I want to ask that, when you train the vision projector, your mentioned 1/10 of the dataset, are they randomly selected from Llava-instruct-655k? Why didn't you use the LLAVA pretraining dataset of LCS-558K?

ByungKwanLee commented 3 months ago

Yes, used randomly selected data through pytorch dataloader. It does not affect performance well. LLaVA trained the model with only training projector in the pretraining stage with pretraining data.

MM1 apple paper has also shown that no matter we choose the type of projector, there is no performance differerence.

By combining these results, we can conclude that the pretraining datset is not necessary cause llava only trained the projector with pretraining dataset.

Technically, I also saw the observation that there is no need to pretrain in terms of performances, cause this may be reason that the difference between pretrain and instruct dataset is answer's length. All has instruction samples.

Instead, I experienced that the most important factor to improving performances is not the number of data samples only (assuming we have enough number of data samples. It easily makes misunderstanding that small number of data can be perofrmant.) but injected external knowledge

cassiaaaaaa commented 2 months ago

Yes, used randomly selected data through pytorch dataloader. It does not affect performance well. LLaVA trained the model with only training projector in the pretraining stage with pretraining data.

MM1 apple paper has also shown that no matter we choose the type of projector, there is no performance differerence.

By combining these results, we can conclude that the pretraining datset is not necessary cause llava only trained the projector with pretraining dataset.

Technically, I also saw the observation that there is no need to pretrain in terms of performances, cause this may be reason that the difference between pretrain and instruct dataset is answer's length. All has instruction samples.

Instead, I experienced that the most important factor to improving performances is not the number of data samples only (assuming we have enough number of data samples. It easily makes misunderstanding that small number of data can be perofrmant.) but injected external knowledge

Sorry to bother again. I have other questions about the training. How many cards did you use during training and how long is the training time for the two stages?

ByungKwanLee commented 2 months ago

Yes, used randomly selected data through pytorch dataloader. It does not affect performance well. LLaVA trained the model with only training projector in the pretraining stage with pretraining data. MM1 apple paper has also shown that no matter we choose the type of projector, there is no performance differerence. By combining these results, we can conclude that the pretraining datset is not necessary cause llava only trained the projector with pretraining dataset. Technically, I also saw the observation that there is no need to pretrain in terms of performances, cause this may be reason that the difference between pretrain and instruct dataset is answer's length. All has instruction samples. Instead, I experienced that the most important factor to improving performances is not the number of data samples only (assuming we have enough number of data samples. It easily makes misunderstanding that small number of data can be perofrmant.) but injected external knowledge

Sorry to bother again. I have other questions about the training. How many cards did you use during training and how long is the training time for the two stages?

We use approximately 665K number of training samples, and two or three days are consumed for each training step with 6 x A6000.

cassiaaaaaa commented 1 month ago

Dear author, thanks a lot for your help! I have another question. In the inference code, you make prompt with: "prompt = " [UNUSED_TOKEN_146]user\n" + prompt + "[UNUSED_TOKEN_145]\n[UNUSED_TOKEN_146]assistant\n"" Seems a little different with llava. In training, did the prompt processing in the same way with that in inference?

ByungKwanLee commented 1 month ago

Dear author, thanks a lot for your help! I have another question. In the inference code, you make prompt with: "prompt = " [UNUSED_TOKEN_146]user\n" + prompt + "[UNUSED_TOKEN_145]\n[UNUSED_TOKEN_146]assistant\n"" Seems a little different with llava. In training, did the prompt processing in the same way with that in inference?

The input prompt is important to conduct instruction tuning and it is really dependent with language model. However, the prompt in inference might not be too sensitive to generate desired answers.

cassiaaaaaa commented 1 month ago

mportant to conduct instruction tuning and it is really dependent with language model. However, the prompt in inference might not be too sensitive to generate desired answers.

Thanks a lot for your kind reply! That's very useful. So, in training, you changed the system prompt to ""AI assistant should give helpful and detailed answers to user after fully understanding an image.",and keep the left setting of conversation in the same way with Llava?

ByungKwanLee commented 1 month ago

mportant to conduct instruction tuning and it is really dependent with language model. However, the prompt in inference might not be too sensitive to generate desired answers.

Thanks a lot for your kind reply! That's very useful. So, in training, you changed the system prompt to ""AI assistant should give helpful and detailed answers to user after fully understanding an image.",and keep the left setting of conversation in the same way with Llava?

Yes. As I experienced, however, whatever content of system prompt conversation did not affect the performances well. It is just a format

cassiaaaaaa commented 1 month ago

nversation did not affect the performances well. It is just a format

It is quite reasonable. I am troubled by reproducing a training code. In inference, the cv models seems not so fast, it requires several seconds to run all the cv models for one image. How did you reach a high speed in training? By the way, is your training code based on LLava or InternLM or InternLM-X-Composer?