Closed Maxlinn closed 6 months ago
Hi,
I think I just located the issue. The Infoseek subset we uploaded seems to have instruction in the question
column. And in the example code, this is not considered, leading to duplicated instruction in the query.
I commented out the code to add random instruction, and here is my reproduction:
2 A100 GPUs for indexing and 1 GPU for searching
ViT-B
Total number of questions: 4708 Recall@1: 0.24171622769753612 Recall@5: 0.49405267629566696 Recall@10: 0.6057774001699235 Recall@20: 0.7104927782497876 Recall@50: 0.81053525913339 Recall@100: 0.8602378929481733
ViT-G
Total number of questions: 4708 Recall@1: 0.3016142735768904 Recall@5: 0.5630841121495327 Recall@10: 0.6690739167374682 Recall@20: 0.7550977060322854 Recall@50: 0.8494052676295667 Recall@100: 0.8893372982158029
which is close to the values reported. There is still a small discrepancy, which we are currently investigating. We found that the small discrepancy comes from the conversion from the old pytorch model to the new HF model implementation. In fact, this has been observed in some of our finetuning experiments - finetuning the old pytorch model class behaved slightly differently from finetuning the HF version, though the model parameters are the same. This seems to be a common issue for conversion :(
Thanks for raising this issue. We plan to release an updated version of M2KR to fix these issues. Also the official Infoseek data have been changed - we will try to make a patch for that as well. We also plan to generate a new benchmark result table with the converted HF models. But ofc feel free to get the number with your own codes, as long as the environment is kept consistent. In any case, finetuning the model on downstream tasks is recommended.
thanks for quick response and efforts to check! yes, after i removed the random_instruction in example_use_preflmr.py
the results are very much close to yours!
=============================
Inference summary:
=============================
Total number of questions: 4708
Recall@1: 0.308411214953271
Recall@5: 0.5635089209855565
Recall@10: 0.6709855564995751
Recall@20: 0.7646559048428208
Recall@50: 0.8525913338997451
Recall@100: 0.8927357689039932
=============================
Done! Program exiting...
my purpose is to reproduce the infoseek results in preflmr paper table 7, where RA-VQAv2 w/ PreFLMR
is 30.65
. so i wonder if you could told me about
deeply sorry for raising so many questions, much appreciation to your generous help!
You don't have to finetune PreFLMR if you just wanna train the RAG part. You can either use the HF model to get retrieved documents or simply use the files that I sent to you, which were generated by the Pytorch version of PreFLMR-G.
The BLIP2 with FLMR is an example of training BLIP2 with RAG. Unfortunately, the infoseek config I had does not apply to the new M2KR dataset. If you want to use the same framework as OKVQA, you can write a config to replace OKVQA with Infoseek. Or you can write your own code with any framework you like to fine-tune a RAG-version BLIP2 (which you should be able to find examples on the Internet) by passing in the retrieved documents.
We performed a naive RAG training in our experiments: Question: xxxx Knowledge: [retrieved doc k] Answer: xxxx. Other details (like how to perform inference) can be found in our BLIP2 script. The fine-tuning parameters can be found in Appendix B.2.
thanks for timely reply : )
for finetuning preflmr, i still want to finetuning it on m2kr infoseek train split to reproduce the results on paper. i found code excerpt at Training with contrastive learning does return a loss. if there has not been a train script yet, i can manually craft one with hf trainer, but still needs superparameters. after arefully read preflmr paper appendix b.2. (page 17 bottom) "Single-task Downstream Finetuning" section, i still need important superparameters like learning rate, number of training examples, number of epochs, and scheduler settings, wondering if you could generously help.
for finetuning blip2 flan-t5 reader, i could manually craft scripts to achieve that. however it also requires those important superparameters(learning rate, number of training examples, number of epochs, and scheduler), the "VQA Finetuning" section at that page seems not have shown.
and, just to confirm, for each question in a training batch, top 5 relevant documents were pre-extracted using the retriever, and 3 out of 5 were randomly selected. so it does not gurantee there is at least one positive document right? and one training example is like Question: {question} Knowledge: title: {title1} content: {content1} title: {title2} content: {content2} title: {title:3} content: {content3} Answer: {answer}
where loss is only caculated on {answer}.
heartfelt appreciation for all your help!
The forward function returns both a loss and an in-batch negative loss. Use the in-batch negative loss. The learning rate is kept at 1e-5 for all parameters, and the training set is the corresponding training set of Infoseek. 4 negative examples per positive example. The scheduler is just Adam (either is fine) with the default setting. Note that Infoseek's training set and validation set have completely different distributions, which means the model will overfit very very quickly. So on my side, 500-1000 steps are sufficient (in fact I don't suggest fine-tuning on Infoseek to evaluate any retrieval model since a more powerful model overfits even faster....)
Just to mention, the converted HF checkpoint may behave slightly differently in fine-tuning. You may want to adjust the parameters on your own. My experiments were done with the old Pytorch model. I have tried to fine-tune the HF model on other retrieval datasets and the fine-tuning worked. So it should be fine.
For the BLIP2 training, the learning rate is kept at 1e-5, and the training set is the corresponding training set of Infoseek. after 2k-3k training sets the model starts to overfit. The scheduler is Adam. In a batch, 5 documents are extracted from the static retrieval results. 3 documents are selected, but if any of the 5 documents contain a pseudo-gold answer, we will ensure that the 3 documents selected contain at least one pseudo-gold document (to some extent avoid training the model with all irrelevant documents, but still need to keep its ability to distinguish wrong documents). Then 3 sequences (Question: {question} Knowledge: title: {title1} content: {content1}, Question: {question} Knowledge: title: {title2} content: {content2}, ...) are passed to the model and generate the answer. Compute cross-entropy. Train the model. (note that not concatenating all 3 documents in a sequence)
ofc, if you have enough GPU memory, feel free to use the extracted 5 documents without selection. The selection is for reducing GPU memory.
Thank you very much for your assistance, I now have a better grasp of the details!
I have one last small question to ask: will there be a PyTorch format model of PreFLMR release? and if so, will it be compatible with the RAVQA or FLMR codebase?
I just happen to find that infoseek train subset of m2kr has 676441
examples (both in README.md and manually sum of counting) while official has 934048
, may i ask the reason for that? if it has been filtered, may i ask the criteria?
Thanks again!
Our group is also moving to the new HF format now. Future models will be trained with the HF implementation to ensure that there is no potential conversion loss.
The set has been filtered. Though the original author said in the paper that they filtered the examples to ensure that the answers (either float or str) appear in the document, we still found a significant number of samples where the answers can not be found. Therefore, we followed their paper to further filter data to make sure at least one of the answers to each question appears in the ground-truth document (if a value range is provided, we searched for all valid numbers and checked if any number is within the provided range).
thanks for all your generous help!
Hi lin sorry to bother you, I am trying to reproduce the result of infoseek using preflmr model but the numbers are not close.
According to preflmr paper table 2, the reported PreFLMR(G B-v2 1.96B variant) zero-shot R@5 performance on infoseek is
59.6
. However my results are as follows, following guide of Reproduce PreFLMR results:My reproduce script is almost indentical to the given one except pulling models and datasets beforehand due to network issues.
In
example_use_preflmr.py
i am settingindex_custom_collection(n_ranks=8)
on line 48 to accelerate building index, and deletenum_proc=16
inds.map(tokenize_inputs)
on line 240 due to stucking atMap: 0/4708
, but i believe both changes do not affect the results.Besides, based on the retrieval results you sent me before (which i am very grateful) at issue#37 of RA-VQA, i calculated the R@5 manally, it is
40.57
, similar to my reproducing results but not close to the reported results on paper.In the
infoseek_blip2.zip
you sent me, the files are as followsHere is how i calculated
it returns
Did i miss something? Could you please give me some hint? Thanks!