fail to reproduce m2kr infoseek subset results using preflmr model

Maxlinn commented 6 months ago

Hi lin sorry to bother you, I am trying to reproduce the result of infoseek using preflmr model but the numbers are not close.

According to preflmr paper table 2, the reported PreFLMR(G B-v2 1.96B variant) zero-shot R@5 performance on infoseek is 59.6. However my results are as follows, following guide of Reproduce PreFLMR results:

Map: 100%|██████████| 4708/4708 [06:31<00:00, 12.03 examples/s]
=============================
Inference summary:
=============================
Total number of questions: 4708
Recall@1:        0.22854715378079865
Recall@5:        0.42247238742565846
Recall@10:       0.5218776550552251
Recall@20:       0.6076890399320306
Recall@50:       0.7051826677994902
Recall@100:      0.764231096006797
=============================
Done! Program exiting...

My reproduce script is almost indentical to the given one except pulling models and datasets beforehand due to network issues.

In example_use_preflmr.py i am setting index_custom_collection(n_ranks=8) on line 48 to accelerate building index, and delete num_proc=16 in ds.map(tokenize_inputs) on line 240 due to stucking at Map: 0/4708, but i believe both changes do not affect the results.

python FLMR/examples/example_use_preflmr.py \
    --use_gpu --run_indexing \
    --index_root_path "./preflmr_index" \
    --index_name Infoseek_PreFLMR_ViT-G \
    --experiment_name Infoseek \
    --indexing_batch_size 64 \
    --image_root_dir "./Infoseek/val_images" \
    --dataset_hf_path "./data/multi_task_multi_modal_knowledge_retrieval_benchmark_M2KR" \
    --dataset Infoseek \
    --use_split test \
    --nbits 8 \
    --Ks 1 5 10 20 50 100 \
    --checkpoint_path "./pretrained_models/PreFLMR_ViT-G" \
    --image_processor_name "laion/CLIP-ViT-bigG-14-laion2B-39B-b160k" \
    --query_batch_size 8

Besides, based on the retrieval results you sent me before (which i am very grateful) at issue#37 of RA-VQA, i calculated the R@5 manally, it is 40.57, similar to my reproducing results but not close to the reported results on paper.

In the infoseek_blip2.zip you sent me, the files are as follows

generate_test_index_test_InfoseekDatasetForDPR.valid_predictions_rank_0.pkl
generate_test_index_test_InfoseekDatasetForDPR.valid_predictions_rank_1.pkl
generate_test_index_test_InfoseekDatasetForDPR.valid_predictions_rank_2.pkl
generate_test_index_test_InfoseekDatasetForDPR.valid_predictions_rank_3.pkl
generate_train_index_test_InfoseekDatasetForDPR.train_predictions_rank_0.pkl
generate_train_index_test_InfoseekDatasetForDPR.train_predictions_rank_1.pkl
generate_train_index_test_InfoseekDatasetForDPR.train_predictions_rank_2.pkl
generate_train_index_test_InfoseekDatasetForDPR.train_predictions_rank_3.pkl
model_step_2000.ckpt

Here is how i calculated

from glob import glob
import pickle

## load results
test_results = []
for path in glob('./unzipped/generate_test_index_test_InfoseekDatasetForDPR*'):
    test_results.extend(pickle.load(open(path, 'rb'))['output'])
print('len(test_results)', len(test_results))

## load test passages(which is identical to train passages)
test_passages = pd.read_parquet('./data/multi_task_multi_modal_knowledge_retrieval_benchmark_M2KR/Infoseek_passages/test_passages-00000-of-00001.parquet')
test_passages.set_index('passage_id', inplace=True)

## load test set(which contains `pos_item_ids` that are golden labels)
test = pd.read_parquet('./data/multi_task_multi_modal_knowledge_retrieval_benchmark_M2KR/Infoseek_data/test-00000-of-00001.parquet')
test.set_index('question_id', inplace=True)

## calculate result
topk = 5

n_hit = 0
for result in test_results:
    question_id = result['question_id']
    pos_item_ids = test.loc[question_id]['pos_item_ids']

    top_ranking_passage_ids = [ obj['passage_id'] for obj in result['top_ranking_passages']]

    if any(pos_item_id in top_ranking_passage_ids[:topk] 
           for pos_item_id in pos_item_ids):
        n_hit += 1

r_at_n = n_hit / len(test_results)
print(f'R@{topk}:', r_at_n)

it returns

len(test_results) 4708
R@5: 0.4056924384027188

Did i miss something? Could you please give me some hint? Thanks!

LinWeizheDragon commented 6 months ago

Hi,

I think I just located the issue. The Infoseek subset we uploaded seems to have instruction in the question column. And in the example code, this is not considered, leading to duplicated instruction in the query. I commented out the code to add random instruction, and here is my reproduction:

2 A100 GPUs for indexing and 1 GPU for searching

ViT-B

Total number of questions: 4708 Recall@1: 0.24171622769753612 Recall@5: 0.49405267629566696 Recall@10: 0.6057774001699235 Recall@20: 0.7104927782497876 Recall@50: 0.81053525913339 Recall@100: 0.8602378929481733

ViT-G

Total number of questions: 4708 Recall@1: 0.3016142735768904 Recall@5: 0.5630841121495327 Recall@10: 0.6690739167374682 Recall@20: 0.7550977060322854 Recall@50: 0.8494052676295667 Recall@100: 0.8893372982158029

which is close to the values reported. There is still a small discrepancy, which we are currently investigating. We found that the small discrepancy comes from the conversion from the old pytorch model to the new HF model implementation. In fact, this has been observed in some of our finetuning experiments - finetuning the old pytorch model class behaved slightly differently from finetuning the HF version, though the model parameters are the same. This seems to be a common issue for conversion :(

Thanks for raising this issue. We plan to release an updated version of M2KR to fix these issues. Also the official Infoseek data have been changed - we will try to make a patch for that as well. We also plan to generate a new benchmark result table with the converted HF models. But ofc feel free to get the number with your own codes, as long as the environment is kept consistent. In any case, finetuning the model on downstream tasks is recommended.

Maxlinn commented 6 months ago

thanks for quick response and efforts to check! yes, after i removed the random_instruction in example_use_preflmr.py the results are very much close to yours!

=============================
Inference summary:
=============================
Total number of questions: 4708
Recall@1:    0.308411214953271
Recall@5:    0.5635089209855565
Recall@10:   0.6709855564995751
Recall@20:   0.7646559048428208
Recall@50:   0.8525913338997451
Recall@100:  0.8927357689039932
=============================
Done! Program exiting...

my purpose is to reproduce the infoseek results in preflmr paper table 7, where RA-VQAv2 w/ PreFLMR is 30.65. so i wonder if you could told me about

how can i continue finetuning PreFLMR models: in Training with contrastive learning there is an excerpt of model.forward, is there anything more like a finetuning script with superparameters?
how can i fineune RA-VQAv2 to obtain the final results: i carefully checked RA-VQAv2 codebase BLIP2 with FLMR, it seems there is not a config for infoseek. could you please share it if you may? or, considering the retrieval and qa are seperated procedures, may i just finetune an official Salesforce/blip2-flan-t5-xl to generate an answer? in this case, could you illustrate how the training examples are built (prompts with multiple retrieved documents, each with a title and content) and the finetuning superparameters(i found some details in preflmr paper B.2 VQA Finetuning part, but still needs to know learning rate, number of examples, number of epochs, etc)

deeply sorry for raising so many questions, much appreciation to your generous help!

LinWeizheDragon commented 6 months ago

You don't have to finetune PreFLMR if you just wanna train the RAG part. You can either use the HF model to get retrieved documents or simply use the files that I sent to you, which were generated by the Pytorch version of PreFLMR-G.

The BLIP2 with FLMR is an example of training BLIP2 with RAG. Unfortunately, the infoseek config I had does not apply to the new M2KR dataset. If you want to use the same framework as OKVQA, you can write a config to replace OKVQA with Infoseek. Or you can write your own code with any framework you like to fine-tune a RAG-version BLIP2 (which you should be able to find examples on the Internet) by passing in the retrieved documents.

We performed a naive RAG training in our experiments: Question: xxxx Knowledge: [retrieved doc k] Answer: xxxx. Other details (like how to perform inference) can be found in our BLIP2 script. The fine-tuning parameters can be found in Appendix B.2.

Maxlinn commented 6 months ago

thanks for timely reply : )

for finetuning preflmr, i still want to finetuning it on m2kr infoseek train split to reproduce the results on paper. i found code excerpt at Training with contrastive learning does return a loss. if there has not been a train script yet, i can manually craft one with hf trainer, but still needs superparameters. after arefully read preflmr paper appendix b.2. (page 17 bottom) "Single-task Downstream Finetuning" section, i still need important superparameters like learning rate, number of training examples, number of epochs, and scheduler settings, wondering if you could generously help.

for finetuning blip2 flan-t5 reader, i could manually craft scripts to achieve that. however it also requires those important superparameters(learning rate, number of training examples, number of epochs, and scheduler), the "VQA Finetuning" section at that page seems not have shown. and, just to confirm, for each question in a training batch, top 5 relevant documents were pre-extracted using the retriever, and 3 out of 5 were randomly selected. so it does not gurantee there is at least one positive document right? and one training example is like Question: {question} Knowledge: title: {title1} content: {content1} title: {title2} content: {content2} title: {title:3} content: {content3} Answer: {answer} where loss is only caculated on {answer}.

heartfelt appreciation for all your help!

LinWeizheDragon commented 6 months ago

The forward function returns both a loss and an in-batch negative loss. Use the in-batch negative loss. The learning rate is kept at 1e-5 for all parameters, and the training set is the corresponding training set of Infoseek. 4 negative examples per positive example. The scheduler is just Adam (either is fine) with the default setting. Note that Infoseek's training set and validation set have completely different distributions, which means the model will overfit very very quickly. So on my side, 500-1000 steps are sufficient (in fact I don't suggest fine-tuning on Infoseek to evaluate any retrieval model since a more powerful model overfits even faster....)

Just to mention, the converted HF checkpoint may behave slightly differently in fine-tuning. You may want to adjust the parameters on your own. My experiments were done with the old Pytorch model. I have tried to fine-tune the HF model on other retrieval datasets and the fine-tuning worked. So it should be fine.

For the BLIP2 training, the learning rate is kept at 1e-5, and the training set is the corresponding training set of Infoseek. after 2k-3k training sets the model starts to overfit. The scheduler is Adam. In a batch, 5 documents are extracted from the static retrieval results. 3 documents are selected, but if any of the 5 documents contain a pseudo-gold answer, we will ensure that the 3 documents selected contain at least one pseudo-gold document (to some extent avoid training the model with all irrelevant documents, but still need to keep its ability to distinguish wrong documents). Then 3 sequences (Question: {question} Knowledge: title: {title1} content: {content1}, Question: {question} Knowledge: title: {title2} content: {content2}, ...) are passed to the model and generate the answer. Compute cross-entropy. Train the model. (note that not concatenating all 3 documents in a sequence)

ofc, if you have enough GPU memory, feel free to use the extracted 5 documents without selection. The selection is for reducing GPU memory.

Maxlinn commented 6 months ago

Thank you very much for your assistance, I now have a better grasp of the details!

I have one last small question to ask: will there be a PyTorch format model of PreFLMR release? and if so, will it be compatible with the RAVQA or FLMR codebase?

I just happen to find that infoseek train subset of m2kr has 676441 examples (both in README.md and manually sum of counting) while official has 934048, may i ask the reason for that? if it has been filtered, may i ask the criteria?

Thanks again!

LinWeizheDragon commented 6 months ago

Our group is also moving to the new HF format now. Future models will be trained with the HF implementation to ensure that there is no potential conversion loss.

The set has been filtered. Though the original author said in the paper that they filtered the examples to ensure that the answers (either float or str) appear in the document, we still found a significant number of samples where the answers can not be found. Therefore, we followed their paper to further filter data to make sure at least one of the answers to each question appears in the ground-truth document (if a value range is provided, we searched for all valid numbers and checked if any number is within the provided range).

Maxlinn commented 6 months ago

thanks for all your generous help!

LinWeizheDragon / FLMR

fail to reproduce m2kr infoseek subset results using preflmr model #9