Open Baeksweety opened 1 month ago
In stage 1, we only trained the mapping structure (including the query_vision_encoder_linear and mapping transformers). This can be achieved by simply masking out the query features corresponding to the text encoder, i.e., reducing the feature vector from [text encoder output, mapping structure output] to [0, mapping structure output] when calculating the late interaction scores.
If you don't mind, could you share the relevant code? Thanks!
Hi, I forget that I have some relevant codes for that
You can set concat_output_from_text_encoder=False
when using the query()
function in inference, and set query_concat_output_from_text_encoder=False
when using the forward()
function in training to mask out text outputs.
Also, I find that transformer_mapping_network() function has the argument text_encoder_hidden_states, does this cause the gradient backhaul of the text encoder during training?
Yes, the transformer mapping network takes in the features of the second last layer of the ViT and the last layer of the text encoder. We freeze the text encoder and the ViT encoder during Phase 1 pretraining.
Im sorry to bother you again, but I want to ask that how is the data used for pre-training in stage 1 organized considering that it uses a lot of datasets?
You can refer to the appendix of the paper for the data used in Stage 1. To implement the data merge, you can use huggingface datasets.concatenate_datasets to combine different subsets. And the program chooses one entry at a time randomly. Note that the negative docs should be chosen from the same subset's passage pool. This should be fairly easy. Another approach could be that you write a new dataloader to dynamically load all subsets and choose subset according to the rule. This can be slightly complicated but offers great flexibility in controlling the ratio of each subset.
How are the initialization parameters of the query vision encoder obtained during the pre-training phase? I only saw the random initialization process in the code and would like to know more about the specific initialization parameter selection. Thanks!
You can manually load pre-trained weights from ViT and ColBERTv2 respectively using pytorch's load function. The mapping network is randomly initialised.
Hello, based on the amount of data shown in table 9 of your article and the training settings of stage 1, I calculated that the required training epoch is 2. Is this calculation correct? Thank you for your answer!
Sorry for the late reply. Was busy around. I noticed that Table 9 is slightly wrong. As said in the main paper, in Stage 1, KB-VQA datasets are not used (EVQA, Infoseek, OKVQA). Therefore, the total training examples are ~4550k, the total number of epochs should be around 12 for Stage 1.
Hi, I have some questions about pretraining. It seems that there is not any code about stage1 pretraing. I want to know more details about this stage. Thanks!