LinWeizheDragon / FLMR

The huggingface implementation of Fine-grained Late-interaction Multi-modal Retriever.
68 stars 4 forks source link

Something about pretraining #30

Open Baeksweety opened 1 month ago

Baeksweety commented 1 month ago

Hi, I have some questions about pretraining. It seems that there is not any code about stage1 pretraing. I want to know more details about this stage. Thanks!

LinWeizheDragon commented 1 month ago

In stage 1, we only trained the mapping structure (including the query_vision_encoder_linear and mapping transformers). This can be achieved by simply masking out the query features corresponding to the text encoder, i.e., reducing the feature vector from [text encoder output, mapping structure output] to [0, mapping structure output] when calculating the late interaction scores.

Baeksweety commented 1 month ago

If you don't mind, could you share the relevant code? Thanks!

LinWeizheDragon commented 1 month ago

Hi, I forget that I have some relevant codes for that You can set concat_output_from_text_encoder=False when using the query() function in inference, and set query_concat_output_from_text_encoder=False when using the forward() function in training to mask out text outputs.

Baeksweety commented 1 month ago

Also, I find that transformer_mapping_network() function has the argument text_encoder_hidden_states, does this cause the gradient backhaul of the text encoder during training?

LinWeizheDragon commented 1 month ago

Yes, the transformer mapping network takes in the features of the second last layer of the ViT and the last layer of the text encoder. We freeze the text encoder and the ViT encoder during Phase 1 pretraining.

Baeksweety commented 4 weeks ago

Im sorry to bother you again, but I want to ask that how is the data used for pre-training in stage 1 organized considering that it uses a lot of datasets?

LinWeizheDragon commented 4 weeks ago

You can refer to the appendix of the paper for the data used in Stage 1. To implement the data merge, you can use huggingface datasets.concatenate_datasets to combine different subsets. And the program chooses one entry at a time randomly. Note that the negative docs should be chosen from the same subset's passage pool. This should be fairly easy. Another approach could be that you write a new dataloader to dynamically load all subsets and choose subset according to the rule. This can be slightly complicated but offers great flexibility in controlling the ratio of each subset.

Baeksweety commented 3 weeks ago

How are the initialization parameters of the query vision encoder obtained during the pre-training phase? I only saw the random initialization process in the code and would like to know more about the specific initialization parameter selection. Thanks!

LinWeizheDragon commented 3 weeks ago

You can manually load pre-trained weights from ViT and ColBERTv2 respectively using pytorch's load function. The mapping network is randomly initialised.

Baeksweety commented 2 weeks ago

Hello, based on the amount of data shown in table 9 of your article and the training settings of stage 1, I calculated that the required training epoch is 2. Is this calculation correct? Thank you for your answer!

LinWeizheDragon commented 2 weeks ago

Sorry for the late reply. Was busy around. I noticed that Table 9 is slightly wrong. As said in the main paper, in Stage 1, KB-VQA datasets are not used (EVQA, Infoseek, OKVQA). Therefore, the total training examples are ~4550k, the total number of epochs should be around 12 for Stage 1.