LinWeizheDragon / Retrieval-Augmented-Visual-Question-Answering

This is the official repository for Retrieval Augmented Visual Question Answering
GNU General Public License v3.0
157 stars 14 forks source link

Where can I find the speed comparison between DPR and FLMR, PREFLMR #38

Closed qwqwq1445 closed 4 months ago

qwqwq1445 commented 5 months ago

Thanks for your awesome work! I just want to know the disparity of the speeds between DPR and FLMR (or PREFLMR) Looking forward to your reply

LinWeizheDragon commented 5 months ago

Hi,

In practice, the speed bottleneck is at query encoding. For example, with PreFLMR-ViT_G, it takes around 1s to encode the query, and 0.2s to retrieve 100 documents from 400K documents. If using DPR, it may take much less time in retrieval (with FAISS, this could be 0.01s) but the encoding time is not reducible. FLMR takes more time in encoding since multiple regions of interest need to be encoded by the ViT encoder. Of course, batching all ROIs can make the encoding time similar to that of PreFLMR.

You can find a demo in the project page.

qwqwq1445 commented 5 months ago

image image

In the PREFLMR paper, it's said that FLMR utilizes the CLS representation of VIT. However, I cannot find any relevant information in the FLMR(RA-VQAv2) paper. I would appreciate it if you could kindly explain it to me.

LinWeizheDragon commented 5 months ago

ViT model has the input: [CLS] patch1 patch2 ...... FLMR takes the embeddings at the last layer output that belongs to [CLS] as the visual representation for the input image. Then this vector is further passed through a light-weight MLP layer to obtain visual tokens that will be used in late interaction.

qwqwq1445 commented 5 months ago

Thanks for your reply. I notice that you also show the retrieval results of PREFLMR over WIT. I wonder have you ever tried to test your model on the OKVQA with only WIT subset for retrival?

LinWeizheDragon commented 5 months ago

We did that in FLMR's paper, but not in PreFLMR. You can find relevant information in the FLMR's paper (appendix).

qwqwq1445 commented 5 months ago

We did that in FLMR's paper, but not in PreFLMR. You can find relevant information in the FLMR's paper (appendix).

Sorry, I made a mistake. image The retrieval performance is great and what I really mean is: have you ever tried to test RA-VQAv2(equipped with FLMR) to conduct OKVQA only with WIT as the knowledge source

LinWeizheDragon commented 5 months ago

No, we haven't done that in our experiments.

qwqwq1445 commented 5 months ago

image Excuse me, could you please tell me how did you select the subset of WIT?

LinWeizheDragon commented 5 months ago

image It is the same as the Wikipedia corpus. Common concepts, in practice, were from the visual concepts of VinVL object detection (all objects in the VinVL classifier vocabulary), such as dog, cat, umbrella, etc.

qwqwq1445 commented 5 months ago

image It is the same as the Wikipedia corpus. Common concepts, in practice, were from the visual concepts of VinVL object detection (all objects in the VinVL classifier vocabulary), such as dog, cat, umbrella, etc.

hi, would you kindly release the WIT subset filtered by VINVL?

qwqwq1445 commented 5 months ago

By the way, I try to use PREFLMR with the FLMR repository. How can I build the retrieval Index with passage contents as well as the pictures? (JUST LIKE WIT) It seems that in the example_use_preflmr.py the retrieval index is only built with the passage contents.

LinWeizheDragon commented 5 months ago

image It is the same as the Wikipedia corpus. Common concepts, in practice, were from the visual concepts of VinVL object detection (all objects in the VinVL classifier vocabulary), such as dog, cat, umbrella, etc.

hi, would you kindly release the WIT subset filtered by VINVL?

Hi, it has been a long while since I ran this experiment. Now I could not find the prepared corpus (since it was an intermediate cache file generated by the framework).

LinWeizheDragon commented 5 months ago

By the way, I try to use PREFLMR with the FLMR repository. How can I build the retrieval Index with passage contents as well as the pictures? (JUST LIKE WIT) It seems that in the example_use_preflmr.py the retrieval index is only built with the passage contents.

https://github.com/LinWeizheDragon/FLMR?tab=readme-ov-file#index-a-custom-document-collection

The commented codes show how to index a multi-modal corpus. Note that the vision embedding is extracted by the vision model and the linear projection layer of the query encoder. If you want a better performance, you should fine-tune the model with multi-modal queries as well as multi-modal documents.

qwqwq1445 commented 4 months ago

Hi, I have another question: when conducting inference with Google Corpus, does RA-VQAv2 input the whole paragraph of the retrieved documents into BLIP2 or T5_large for the answer? Do you do any segmentation on the documents?

LinWeizheDragon commented 4 months ago

No segmentation is done. The whole passage is used until it reaches the maximum number of tokens allowed. Since these two models are both encoder-decoder, this does not affect the generation on the decoder side. This is to align with previous work since they did not report to use segmentation.

qwqwq1445 commented 4 months ago

During inference, each retrieved document is put into the answer generator and we get n answers. How to select one answer from these answers?

LinWeizheDragon commented 4 months ago

Hi, the one with the highest confidence is chosen.

qwqwq1445 commented 4 months ago

image It is the same as the Wikipedia corpus. Common concepts, in practice, were from the visual concepts of VinVL object detection (all objects in the VinVL classifier vocabulary), such as dog, cat, umbrella, etc.

Could you please provide the VINVL's classifier vocabulary? I have difficulties in looking for it on the Internet.

LinWeizheDragon commented 4 months ago

VG-SGG-dicts-danfeiX-clipped.json

qwqwq1445 commented 4 months ago

Excuse me; I have questions about the filtering process of the WIT dataset. I first downloaded the WIT subset from your released retrieval dataset named M2KR, the number of the training set of the WIT_passages is about 4.2M, which is much smaller than the original WIT. Then I tried to filter the WIT dataset with your provided VINVL classifier list. If any keyword is found in the 'passage_content' column of a WIT data item, the item should be reserved. However, in each parquet file of the WIT training set of M2KR, the average number of filtered items is about 65k, and the size of the parquet file is about 150K. Filtering the whole training set results in a number of approximately 65k * 27. However, the number of filtered WIT datasets mentioned in the paper is only 87k. Is there anything wrong with my filtering process? image

LinWeizheDragon commented 4 months ago

The WIT set used in FLMR and PreFLMR are different. The one in FLMR (which you showed in the figure) used the first 5 training splits of the original WIT dataset. The filtering process has been discussed previously. One thing you did wrong was to match passage content - the classifier list contains many commonly seen objects, and thus it is more reasonable to only match the title to ensure that the relevant concepts are included.

The one in PreFLMR (released through M2KR) is the large-scale training set for training PreFLMR. It comes from all splits, and the preprocessing steps were outlined in the appendix of the PreFLMR paper. VinVL concepts are not used here.

qwqwq1445 commented 4 months ago

Thanks for your reply. I am trying to download the WIT dataset from this link (https://github.com/google-research-datasets/wit/blob/main/DATA.md) , and I want to confirm that you also downloaded the shards of the first five training sets from this link. In addition, there are many different languages in the downloaded dataset tsv file. I need to screen out the records whose language is "en" first, and then filter them by "page_title" whether they contain keywords. Is the procedure I described correct?

LinWeizheDragon commented 4 months ago

Hi, I have to say it will be very difficult to reproduce it exactly. But it should work if you just want to approximate the set I derived. Unfortunately, the WIT data used in FLMR was not saved (since it is just a side experiment). I just read my code again. Here is the general process:

  1. download the first 5 training splits, 1 valid split, and 1 test split from the official repo
  2. download images - this is where the discrepancy may happen - I had to drop passages whose images could not be downloaded.
  3. create passage content from the data

    def process_example(item):
            passage_content = f"title: {item['page_title']}"
            if item['section_title'] is not None:
                passage_content += f" section title: {item['section_title']}"
            if item['hierarchical_section_title'] is not None:
                passage_content += f" hierarchical section title: {item['hierarchical_section_title']}"
            if item['caption_reference_description'] is not None:
                passage_content += f" caption reference description: {item['caption_reference_description']}"
            if item['caption_attribution_description'] is not None:
                passage_content += f" caption attribution description: {item['caption_attribution_description']}"
            if item['caption_alt_text_description'] is not None:
                passage_content += f" caption alt text description: {item['caption_alt_text_description']}"
    
            passage_content += f" content: {item['context_page_description']}"
    
            item['passage_content'] = passage_content
    
            return item
  4. use only English passages and main images. ds = ds.filter(lambda x: x["language"] == 'en' and x['is_main_image'] == True)
  5. create elastic search index using the passage contents
  6. load OKVQA questions and answers. Use elastic search to mark pseudo labels

    @register_transform_functor
    class PrepareWITPassageAnnotations(BaseTransform):
    def setup(self, *args, **kwargs):
        self.module_config = EasyDict(kwargs)
        self.data = EasyDict()
        self.config = self.global_config
    
    def _call(self, inputs, *args, **kwargs):
        """
        This function prepares Wikipedia passage annotations (pseudo labels)
        {
            "annotations_path": {
                "train": "..",
                "valid": "..",
                "test": "..",
            },
        },
        """
        for input_data in inputs:
            self.data.update(input_data)
    
        module_config = self.module_config
    
        ######################
        #  Get weak supervision annotations
        ######################
        self.data.okvqa_data_with_dpr_output = EasyDict({
            'train': {},
            'valid': {},
            'test': {},
            'lookup': {},
        })
        self.data.passages.annotations = EasyDict({})
    
        # Prepare ElasticSearch
        from elasticsearch import Elasticsearch, helpers
    
        # Password for the 'elastic' user generated by Elasticsearch
        ELASTIC_PASSWORD = os.environ["ELASTIC_PASSWORD"]
    
        es = Elasticsearch(
            "https://localhost:9200",
            ca_certs=os.environ["ELASTIC_CA_CERTS"],
            basic_auth=("elastic", ELASTIC_PASSWORD)
        )
    
        # Successful response!
        es.info()
    
        ds = self.data.passages.dataset
        index_name = module_config.index_name
    
        def search_for_a_string(query):
            resp = es.search(index=index_name, query={
                "multi_match" : {
                    "query": query,
                    "fields": ["title", "text"],
                    "type": "phrase",
                }
            }, timeout="60s")
            return resp
    
        from thefuzz import fuzz
        from thefuzz import process
    
        available_documents = {}
    
        for data_split in ['train', 'valid', 'test']:
    
            for item in tqdm(self.data.okvqa_data[data_split].data_items):
                question_id = item.question_id
    
                # Search ES and return all passages containing answers
                passages_match_answer = []
    
                for answer in set(item.answers):
                    passages_match_answer.extend(
                        search_for_a_string(answer)['hits']['hits']
                    )
    
                # print("answers", item.answers)
    
                for i in passages_match_answer:
                    available_documents[str(i['_id'])] = 1
    
                # Rate passages according to query information (e.g. question, objects in the image)
                choices = {
                    i['_id']: i['_source']['text'] for i in passages_match_answer
                }
    
                element_string_in_query = f'{item.gold_answer} {item.gold_answer} {item.question} {item.img_caption["caption"]}'
    
                for obj in item.objects:
                    element_string_in_query += f" {obj['class'].strip().lower()}"
    
                res = process.extract(element_string_in_query, choices, limit=10, scorer=fuzz.token_set_ratio)
                # print("rating", choices, 'according to', item.question)
                # drop lowest score item to further filter down the annotations
                if len(res) > 0:
                    lowest_score = res[-1][1]
                    res = [i for i in res if i[1] > lowest_score]
                else:
                    res = []
    
                knowledge_collection = [
                    i[2] for i in res
                ]
                self.data.passages.annotations[str(question_id)] = {
                    'passages': knowledge_collection,
                }
                # print(f"question {question_id} has {len(knowledge_collection)} passages")
    
        print(f"total #docs {len(ds)}")
        print(f"total #docs with answers {len(available_documents)}")
    
        self.data.passages.available_documents = available_documents
    
        for data_split in ['train', 'valid', 'test']:
            self.data.okvqa_data_with_dpr_output[data_split] = EasyDict({})
            self.data.okvqa_data_with_dpr_output[data_split].data_items = []
    
            missing_entries = []
            missing_data = []
    
            for item in self.data.okvqa_data[data_split].data_items:
                question_id = item['question_id']
                annotation = self.data.passages.annotations.get(str(question_id), None)
    
                if annotation is None:
                    missing_entries.append(str(question_id))
                    # logger.warning("question {} (split {}) not found in knowledge.".format(str(question_id), data_split))
                    if self.config.mode == 'train':
                        continue
                    else:
                        # in testing mode, all samples must be used
                        related_knowledge = [1]
                else: 
                    related_knowledge = annotation['passages']
                    if len(related_knowledge) == 0:
                        missing_data.append(str(question_id))
                        # logger.warning("question {} (split {}) has no related knowledge in annotations.".format(str(question_id), data_split))
                        # related_knowledge = [1]
                        if self.config.mode == 'train':
                            continue
                        else:
                            # in testing mode, all samples must be used
                            related_knowledge = [1]
    
                knowledge_item = EasyDict(dict(item))
                knowledge_item['pos_item_ids'] = related_knowledge
                # knowledge_item['pos_item_contents'] = [
                #     self.data.passages.id2doc[str(passage_id)] for passage_id in related_knowledge
                # ]
                self.data.okvqa_data_with_dpr_output[data_split].data_items.append(knowledge_item)
    
            if len(missing_entries) > 0:
                logger.warning(f"{len(missing_entries)} questions (split {data_split}) not found in knowledge. \n {missing_entries}")
            if len(missing_data) > 0:
                logger.warning(f"{len(missing_data)} questions (split {data_split}) has no annotations. \n {missing_data}")
    
            # Load item data into lookup with question_id as index
            logger.info('Indexing data items...')
    
            for item in tqdm(self.data.okvqa_data_with_dpr_output[data_split].data_items):
                question_id = item['question_id']
                self.data.okvqa_data_with_dpr_output.lookup[str(question_id)] = item
    
            # Report statistics
            logger.info('[Data statistics] loaded with knowledge data split: {}  entries: {}'.format(
                data_split,
                len(self.data.okvqa_data_with_dpr_output[data_split].data_items)))
    
        output_data = EasyDict(
            okvqa_data_with_dpr_output = self.data.okvqa_data_with_dpr_output,
            passages=self.data.passages,
        )
        return output_data
  7. drop questions (of course not test questions) which does not have pseudo documents annotated.

I hope this may help your research. If you are able to reproduce the set closely, please let me know!

qwqwq1445 commented 4 months ago

In RA-VQAv2, the retriever FLMR is trained along with the answer generator. I would like to use the PREFLMR as the retriever. I would appreciate it if you could tell me if you have ever trained the whole pipeline consisting of PREFLMR and BLIP2 in an end-to-end manner. Do you think it's practical and easy to do so? Looking forward to your reply

LinWeizheDragon commented 4 months ago

I did not try training PreFLMR jointly with the generator. It is possible to do joint training (which you can refer to the BLIP2 model) but requires a large GPU memory. I tried to load PreFLMR-G in joint training but 40G A100 is not sufficient. That's why I added the support for static retrieval. You can also try putting the retriever on GPU0 and putting the generator on GPU1 to distribute the load.

qwqwq1445 commented 3 months ago

As for the choice of answers according to the joint probability mentioned in RAVQA, I still have one questions to ask for your answers. I tried using PREFLMR and found that the retrieval returned a score value, in my case around 30, which I got in the variable 'retrieved_docs' around line 95 of examples/example_use_preflmr.py. image I would like to know what this score value means and how it is calculated. I can't find a specific explanation in the code. How can I use this score to calculate the output probability of the retriever?

LinWeizheDragon commented 3 months ago

The scores are the retrieval scores derived by the backend ColBERT engine. It is an approximate retrieval score. You can softmax the scores of the top 100 documents to approximate the retriever probability, with the assumption that documents out of the top 100 occupy only a very small probability.

qwqwq1445 commented 3 months ago

I have difficulties reproducing the results of blip2_t5 w/o and with knowledge. Could you please provide the hyperparameter settings of finetuning blip2? I also wonder if you add additional information in the okvqa datasets when finetuning w/o knowledge, such as adding their corresponding OCR info.

qwqwq1445 commented 3 months ago

I may not have made my question clear. I wonder, were the blip2 fine-tuning results given in the RAVQAv2 paper w/o the knowledge obtained using only the original OKVQA dataset? Did you augment the questions in the dataset with additional information like OCR get such fine-tuning results?

LinWeizheDragon commented 3 months ago

The query contains additional information (like captions) without the retrieved passages. The only difference should be whether you have the retrieved documents or not to enable comparison. image This is the test performance with BLIP2, no knowledge incorporated.

qwqwq1445 commented 3 months ago

Will you kindly release the checkpoint of the best BLIP2 in OKVQA?

qwqwq1445 commented 3 months ago

Hi, I can't find the static retrieval results of OKVQA -> Google Search in this repo. Do I need to conduct the retrieval process myself? I am afraid that if I do retrieval on my machine there will be some unavoidable differences between mine and your results, which will cause difficulties in reproducing your paper results. Could you please kindly release your static retrieved results? Thank you very much.

LinWeizheDragon commented 3 months ago

Will you kindly release the checkpoint of the best BLIP2 in OKVQA?

I will check if I have such a checkpoint. But you can easily train it with the codebase by setting K=1 and comment out the part that appends retrieved documents to query.

LinWeizheDragon commented 3 months ago

Hi, I can't find the static retrieval results of OKVQA -> Google Search in this repo. Do I need to conduct the retrieval process myself? I am afraid that if I do retrieval on my machine there will be some unavoidable differences between mine and your results, which will cause difficulties in reproducing your paper results. Could you please kindly release your static retrieved results? Thank you very much.

I think you will need to generate it on your own. I did not keep them at the end. But with the codebase, you should be able to fully reproduce FLMR (w/ 10 ROIs). And then ultimate VQA performance should not be affected too much since the retrieval performance is almost the same. You can also use the released FLMR model to run on the GS corpus: https://huggingface.co/LinWeizheDragon/FLMR. This should give you basically the same performance as reported in the paper.

qwqwq1445 commented 3 months ago

Hi, I can't find the static retrieval results of OKVQA -> Google Search in this repo. Do I need to conduct the retrieval process myself? I am afraid that if I do retrieval on my machine there will be some unavoidable differences between mine and your results, which will cause difficulties in reproducing your paper results. Could you please kindly release your static retrieved results? Thank you very much.

I think you will need to generate it on your own. I did not keep them at the end. But with the codebase, you should be able to fully reproduce FLMR (w/ 10 ROIs). And then ultimate VQA performance should not be affected too much since the retrieval performance is almost the same. You can also use the released FLMR model to run on the GS corpus: https://huggingface.co/LinWeizheDragon/FLMR. This should give you basically the same performance as reported in the paper.

Which Google search corpus should I use for retrieval? the "okvqa_train_clean.csv" in the official repo?

qwqwq1445 commented 3 months ago

Excuse me; I checked your M2KR repo and found a directory called "OKVQA_passages." The training, validation, and test files in this directory seem the same. Could you tell me the relation between this directory and the GS corpus?

LinWeizheDragon commented 3 months ago

The processed datasets are:

https://huggingface.co/datasets/BByrneLab/OKVQA_FLMR_preprocessed_data https://huggingface.co/datasets/BByrneLab/OKVQA_FLMR_preprocessed_GoogleSearch_passages

These are the OKVQA dataset with GoogleSearch corpus.

The splits in M2KR are with the Wikipedia corpus.

The FLMR checkpoint was trained on the Google Search corpus. The PreFLMR checkpoints were trained on the Wikipedia corpus (and this is why the split in M2KR is with Wikipedia).

In your case, I suggest using Google Search corpus to obtain the best VQA performance as this corpus is dedicated to OK-VQA's VQA task.

qwqwq1445 commented 3 months ago

I have difficulties in reproducing the BLIP2_flantxl w/o knowledge & w/o text-based features. The results in your paper is 54.50 and mine is 50.25. Could you please provide the finetuning parameters? I found this in your repo. image

qwqwq1445 commented 3 months ago

I found this training details in your RAVQAv2 paper. image I wonder which part do you apply LoRA to finetune. Did you utilize LoRA to finetune the T5 model in BLIP2-flanxl?

LinWeizheDragon commented 3 months ago

The finetuning hyperparameters you found are correct Yes, we applied LoRA with the default huggingface-peft setting, which should be applied to the whole model.

qwqwq1445 commented 3 months ago

The finetuning hyperparameters you found are correct Yes, we applied LoRA with the default huggingface-peft setting, which should be applied to the whole model.

the whole model including Qformer, Vit and T5? They both have attention blocks.

LinWeizheDragon commented 3 months ago

Yes, unless the peft package handles the default setting differently for BLIP2. You can try loading it to a BLIP2 model and check the extra weights.

qwqwq1445 commented 3 months ago

Yes, unless the peft package handles the default setting differently for BLIP2. You can try loading it to a BLIP2 model and check the extra weights.

I use the original LAVIS library to reproduce your results. Directly loading a BLIP2 model and applying LoRA on it doesn't work. When we talk about finetuning MLLM, it usually means finetuning the Q-former only. Could you please tell me excactly which part of BLIP2 did you finetune?

LinWeizheDragon commented 3 months ago
>>> model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-flan-t5-xl")
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:36<00:00, 18.14s/it]
>>> from peft import get_peft_config, get_peft_model, LoraConfig, TaskType
>>> peft_config = LoraConfig(
...     task_type=TaskType.SEQ_2_SEQ_LM, inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1
... )
>>> model = get_peft_model(model, peft_config)
>>> model.print_trainable_parameters()
trainable params: 6475776 || all params: 3948922368 || trainable%: 0.16398843523682055

>>> for name, param in model.named_parameters():
...     if "lora" in name:
...             print(name)
... 
base_model.model.vision_model.encoder.layers.0.self_attn.qkv.lora_A.default.weight
base_model.model.vision_model.encoder.layers.0.self_attn.qkv.lora_B.default.weight
base_model.model.vision_model.encoder.layers.1.self_attn.qkv.lora_A.default.weight
base_model.model.vision_model.encoder.layers.1.self_attn.qkv.lora_B.default.weight
base_model.model.vision_model.encoder.layers.2.self_attn.qkv.lora_A.default.weight
base_model.model.vision_model.encoder.layers.2.self_attn.qkv.lora_B.default.weight
base_model.model.vision_model.encoder.layers.3.self_attn.qkv.lora_A.default.weight
base_model.model.vision_model.encoder.layers.3.self_attn.qkv.lora_B.default.weight
base_model.model.vision_model.encoder.layers.4.self_attn.qkv.lora_A.default.weight
base_model.model.vision_model.encoder.layers.4.self_attn.qkv.lora_B.default.weight
base_model.model.vision_model.encoder.layers.5.self_attn.qkv.lora_A.default.weight
base_model.model.vision_model.encoder.layers.5.self_attn.qkv.lora_B.default.weight
base_model.model.vision_model.encoder.layers.6.self_attn.qkv.lora_A.default.weight
base_model.model.vision_model.encoder.layers.6.self_attn.qkv.lora_B.default.weight
base_model.model.vision_model.encoder.layers.7.self_attn.qkv.lora_A.default.weight
base_model.model.vision_model.encoder.layers.7.self_attn.qkv.lora_B.default.weight
base_model.model.vision_model.encoder.layers.8.self_attn.qkv.lora_A.default.weight
base_model.model.vision_model.encoder.layers.8.self_attn.qkv.lora_B.default.weight
base_model.model.vision_model.encoder.layers.9.self_attn.qkv.lora_A.default.weight
base_model.model.vision_model.encoder.layers.9.self_attn.qkv.lora_B.default.weight
base_model.model.vision_model.encoder.layers.10.self_attn.qkv.lora_A.default.weight
base_model.model.vision_model.encoder.layers.10.self_attn.qkv.lora_B.default.weight
base_model.model.vision_model.encoder.layers.11.self_attn.qkv.lora_A.default.weight
base_model.model.vision_model.encoder.layers.11.self_attn.qkv.lora_B.default.weight
base_model.model.vision_model.encoder.layers.12.self_attn.qkv.lora_A.default.weight
base_model.model.vision_model.encoder.layers.12.self_attn.qkv.lora_B.default.weight
base_model.model.vision_model.encoder.layers.13.self_attn.qkv.lora_A.default.weight
base_model.model.vision_model.encoder.layers.13.self_attn.qkv.lora_B.default.weight
base_model.model.vision_model.encoder.layers.14.self_attn.qkv.lora_A.default.weight
base_model.model.vision_model.encoder.layers.14.self_attn.qkv.lora_B.default.weight
base_model.model.vision_model.encoder.layers.15.self_attn.qkv.lora_A.default.weight
base_model.model.vision_model.encoder.layers.15.self_attn.qkv.lora_B.default.weight
base_model.model.vision_model.encoder.layers.16.self_attn.qkv.lora_A.default.weight
base_model.model.vision_model.encoder.layers.16.self_attn.qkv.lora_B.default.weight
base_model.model.vision_model.encoder.layers.17.self_attn.qkv.lora_A.default.weight
base_model.model.vision_model.encoder.layers.17.self_attn.qkv.lora_B.default.weight
base_model.model.vision_model.encoder.layers.18.self_attn.qkv.lora_A.default.weight
base_model.model.vision_model.encoder.layers.18.self_attn.qkv.lora_B.default.weight
base_model.model.vision_model.encoder.layers.19.self_attn.qkv.lora_A.default.weight
base_model.model.vision_model.encoder.layers.19.self_attn.qkv.lora_B.default.weight
base_model.model.vision_model.encoder.layers.20.self_attn.qkv.lora_A.default.weight
base_model.model.vision_model.encoder.layers.20.self_attn.qkv.lora_B.default.weight
base_model.model.vision_model.encoder.layers.21.self_attn.qkv.lora_A.default.weight
base_model.model.vision_model.encoder.layers.21.self_attn.qkv.lora_B.default.weight
base_model.model.vision_model.encoder.layers.22.self_attn.qkv.lora_A.default.weight
base_model.model.vision_model.encoder.layers.22.self_attn.qkv.lora_B.default.weight
base_model.model.vision_model.encoder.layers.23.self_attn.qkv.lora_A.default.weight
base_model.model.vision_model.encoder.layers.23.self_attn.qkv.lora_B.default.weight
base_model.model.vision_model.encoder.layers.24.self_attn.qkv.lora_A.default.weight
base_model.model.vision_model.encoder.layers.24.self_attn.qkv.lora_B.default.weight
base_model.model.vision_model.encoder.layers.25.self_attn.qkv.lora_A.default.weight
base_model.model.vision_model.encoder.layers.25.self_attn.qkv.lora_B.default.weight
base_model.model.vision_model.encoder.layers.26.self_attn.qkv.lora_A.default.weight
base_model.model.vision_model.encoder.layers.26.self_attn.qkv.lora_B.default.weight
base_model.model.vision_model.encoder.layers.27.self_attn.qkv.lora_A.default.weight
base_model.model.vision_model.encoder.layers.27.self_attn.qkv.lora_B.default.weight
base_model.model.vision_model.encoder.layers.28.self_attn.qkv.lora_A.default.weight
base_model.model.vision_model.encoder.layers.28.self_attn.qkv.lora_B.default.weight
base_model.model.vision_model.encoder.layers.29.self_attn.qkv.lora_A.default.weight
base_model.model.vision_model.encoder.layers.29.self_attn.qkv.lora_B.default.weight
base_model.model.vision_model.encoder.layers.30.self_attn.qkv.lora_A.default.weight
base_model.model.vision_model.encoder.layers.30.self_attn.qkv.lora_B.default.weight
base_model.model.vision_model.encoder.layers.31.self_attn.qkv.lora_A.default.weight
base_model.model.vision_model.encoder.layers.31.self_attn.qkv.lora_B.default.weight
base_model.model.vision_model.encoder.layers.32.self_attn.qkv.lora_A.default.weight
base_model.model.vision_model.encoder.layers.32.self_attn.qkv.lora_B.default.weight
base_model.model.vision_model.encoder.layers.33.self_attn.qkv.lora_A.default.weight
base_model.model.vision_model.encoder.layers.33.self_attn.qkv.lora_B.default.weight
base_model.model.vision_model.encoder.layers.34.self_attn.qkv.lora_A.default.weight
base_model.model.vision_model.encoder.layers.34.self_attn.qkv.lora_B.default.weight
base_model.model.vision_model.encoder.layers.35.self_attn.qkv.lora_A.default.weight
base_model.model.vision_model.encoder.layers.35.self_attn.qkv.lora_B.default.weight
base_model.model.vision_model.encoder.layers.36.self_attn.qkv.lora_A.default.weight
base_model.model.vision_model.encoder.layers.36.self_attn.qkv.lora_B.default.weight
base_model.model.vision_model.encoder.layers.37.self_attn.qkv.lora_A.default.weight
base_model.model.vision_model.encoder.layers.37.self_attn.qkv.lora_B.default.weight
base_model.model.vision_model.encoder.layers.38.self_attn.qkv.lora_A.default.weight
base_model.model.vision_model.encoder.layers.38.self_attn.qkv.lora_B.default.weight
base_model.model.language_model.encoder.block.0.layer.0.SelfAttention.q.lora_A.default.weight
base_model.model.language_model.encoder.block.0.layer.0.SelfAttention.q.lora_B.default.weight
base_model.model.language_model.encoder.block.0.layer.0.SelfAttention.v.lora_A.default.weight
base_model.model.language_model.encoder.block.0.layer.0.SelfAttention.v.lora_B.default.weight
base_model.model.language_model.encoder.block.1.layer.0.SelfAttention.q.lora_A.default.weight
base_model.model.language_model.encoder.block.1.layer.0.SelfAttention.q.lora_B.default.weight
base_model.model.language_model.encoder.block.1.layer.0.SelfAttention.v.lora_A.default.weight
base_model.model.language_model.encoder.block.1.layer.0.SelfAttention.v.lora_B.default.weight
base_model.model.language_model.encoder.block.2.layer.0.SelfAttention.q.lora_A.default.weight
base_model.model.language_model.encoder.block.2.layer.0.SelfAttention.q.lora_B.default.weight
base_model.model.language_model.encoder.block.2.layer.0.SelfAttention.v.lora_A.default.weight
base_model.model.language_model.encoder.block.2.layer.0.SelfAttention.v.lora_B.default.weight
base_model.model.language_model.encoder.block.3.layer.0.SelfAttention.q.lora_A.default.weight
base_model.model.language_model.encoder.block.3.layer.0.SelfAttention.q.lora_B.default.weight
base_model.model.language_model.encoder.block.3.layer.0.SelfAttention.v.lora_A.default.weight
base_model.model.language_model.encoder.block.3.layer.0.SelfAttention.v.lora_B.default.weight
base_model.model.language_model.encoder.block.4.layer.0.SelfAttention.q.lora_A.default.weight
base_model.model.language_model.encoder.block.4.layer.0.SelfAttention.q.lora_B.default.weight
base_model.model.language_model.encoder.block.4.layer.0.SelfAttention.v.lora_A.default.weight
base_model.model.language_model.encoder.block.4.layer.0.SelfAttention.v.lora_B.default.weight
base_model.model.language_model.encoder.block.5.layer.0.SelfAttention.q.lora_A.default.weight
base_model.model.language_model.encoder.block.5.layer.0.SelfAttention.q.lora_B.default.weight
base_model.model.language_model.encoder.block.5.layer.0.SelfAttention.v.lora_A.default.weight
base_model.model.language_model.encoder.block.5.layer.0.SelfAttention.v.lora_B.default.weight
base_model.model.language_model.encoder.block.6.layer.0.SelfAttention.q.lora_A.default.weight
base_model.model.language_model.encoder.block.6.layer.0.SelfAttention.q.lora_B.default.weight
base_model.model.language_model.encoder.block.6.layer.0.SelfAttention.v.lora_A.default.weight
base_model.model.language_model.encoder.block.6.layer.0.SelfAttention.v.lora_B.default.weight
base_model.model.language_model.encoder.block.7.layer.0.SelfAttention.q.lora_A.default.weight
base_model.model.language_model.encoder.block.7.layer.0.SelfAttention.q.lora_B.default.weight
base_model.model.language_model.encoder.block.7.layer.0.SelfAttention.v.lora_A.default.weight
base_model.model.language_model.encoder.block.7.layer.0.SelfAttention.v.lora_B.default.weight
base_model.model.language_model.encoder.block.8.layer.0.SelfAttention.q.lora_A.default.weight
base_model.model.language_model.encoder.block.8.layer.0.SelfAttention.q.lora_B.default.weight
base_model.model.language_model.encoder.block.8.layer.0.SelfAttention.v.lora_A.default.weight
base_model.model.language_model.encoder.block.8.layer.0.SelfAttention.v.lora_B.default.weight
base_model.model.language_model.encoder.block.9.layer.0.SelfAttention.q.lora_A.default.weight
base_model.model.language_model.encoder.block.9.layer.0.SelfAttention.q.lora_B.default.weight
base_model.model.language_model.encoder.block.9.layer.0.SelfAttention.v.lora_A.default.weight
base_model.model.language_model.encoder.block.9.layer.0.SelfAttention.v.lora_B.default.weight
base_model.model.language_model.encoder.block.10.layer.0.SelfAttention.q.lora_A.default.weight
base_model.model.language_model.encoder.block.10.layer.0.SelfAttention.q.lora_B.default.weight
base_model.model.language_model.encoder.block.10.layer.0.SelfAttention.v.lora_A.default.weight
base_model.model.language_model.encoder.block.10.layer.0.SelfAttention.v.lora_B.default.weight
base_model.model.language_model.encoder.block.11.layer.0.SelfAttention.q.lora_A.default.weight
base_model.model.language_model.encoder.block.11.layer.0.SelfAttention.q.lora_B.default.weight
base_model.model.language_model.encoder.block.11.layer.0.SelfAttention.v.lora_A.default.weight
base_model.model.language_model.encoder.block.11.layer.0.SelfAttention.v.lora_B.default.weight
base_model.model.language_model.encoder.block.12.layer.0.SelfAttention.q.lora_A.default.weight
base_model.model.language_model.encoder.block.12.layer.0.SelfAttention.q.lora_B.default.weight
base_model.model.language_model.encoder.block.12.layer.0.SelfAttention.v.lora_A.default.weight
base_model.model.language_model.encoder.block.12.layer.0.SelfAttention.v.lora_B.default.weight
base_model.model.language_model.encoder.block.13.layer.0.SelfAttention.q.lora_A.default.weight
base_model.model.language_model.encoder.block.13.layer.0.SelfAttention.q.lora_B.default.weight
base_model.model.language_model.encoder.block.13.layer.0.SelfAttention.v.lora_A.default.weight
base_model.model.language_model.encoder.block.13.layer.0.SelfAttention.v.lora_B.default.weight
base_model.model.language_model.encoder.block.14.layer.0.SelfAttention.q.lora_A.default.weight
base_model.model.language_model.encoder.block.14.layer.0.SelfAttention.q.lora_B.default.weight
base_model.model.language_model.encoder.block.14.layer.0.SelfAttention.v.lora_A.default.weight
base_model.model.language_model.encoder.block.14.layer.0.SelfAttention.v.lora_B.default.weight
base_model.model.language_model.encoder.block.15.layer.0.SelfAttention.q.lora_A.default.weight
base_model.model.language_model.encoder.block.15.layer.0.SelfAttention.q.lora_B.default.weight
base_model.model.language_model.encoder.block.15.layer.0.SelfAttention.v.lora_A.default.weight
base_model.model.language_model.encoder.block.15.layer.0.SelfAttention.v.lora_B.default.weight
base_model.model.language_model.encoder.block.16.layer.0.SelfAttention.q.lora_A.default.weight
base_model.model.language_model.encoder.block.16.layer.0.SelfAttention.q.lora_B.default.weight
base_model.model.language_model.encoder.block.16.layer.0.SelfAttention.v.lora_A.default.weight
base_model.model.language_model.encoder.block.16.layer.0.SelfAttention.v.lora_B.default.weight
base_model.model.language_model.encoder.block.17.layer.0.SelfAttention.q.lora_A.default.weight
base_model.model.language_model.encoder.block.17.layer.0.SelfAttention.q.lora_B.default.weight
base_model.model.language_model.encoder.block.17.layer.0.SelfAttention.v.lora_A.default.weight
base_model.model.language_model.encoder.block.17.layer.0.SelfAttention.v.lora_B.default.weight
base_model.model.language_model.encoder.block.18.layer.0.SelfAttention.q.lora_A.default.weight
base_model.model.language_model.encoder.block.18.layer.0.SelfAttention.q.lora_B.default.weight
base_model.model.language_model.encoder.block.18.layer.0.SelfAttention.v.lora_A.default.weight
base_model.model.language_model.encoder.block.18.layer.0.SelfAttention.v.lora_B.default.weight
base_model.model.language_model.encoder.block.19.layer.0.SelfAttention.q.lora_A.default.weight
base_model.model.language_model.encoder.block.19.layer.0.SelfAttention.q.lora_B.default.weight
base_model.model.language_model.encoder.block.19.layer.0.SelfAttention.v.lora_A.default.weight
base_model.model.language_model.encoder.block.19.layer.0.SelfAttention.v.lora_B.default.weight
base_model.model.language_model.encoder.block.20.layer.0.SelfAttention.q.lora_A.default.weight
base_model.model.language_model.encoder.block.20.layer.0.SelfAttention.q.lora_B.default.weight
base_model.model.language_model.encoder.block.20.layer.0.SelfAttention.v.lora_A.default.weight
base_model.model.language_model.encoder.block.20.layer.0.SelfAttention.v.lora_B.default.weight
base_model.model.language_model.encoder.block.21.layer.0.SelfAttention.q.lora_A.default.weight
base_model.model.language_model.encoder.block.21.layer.0.SelfAttention.q.lora_B.default.weight
base_model.model.language_model.encoder.block.21.layer.0.SelfAttention.v.lora_A.default.weight
base_model.model.language_model.encoder.block.21.layer.0.SelfAttention.v.lora_B.default.weight
base_model.model.language_model.encoder.block.22.layer.0.SelfAttention.q.lora_A.default.weight
base_model.model.language_model.encoder.block.22.layer.0.SelfAttention.q.lora_B.default.weight
base_model.model.language_model.encoder.block.22.layer.0.SelfAttention.v.lora_A.default.weight
base_model.model.language_model.encoder.block.22.layer.0.SelfAttention.v.lora_B.default.weight
base_model.model.language_model.encoder.block.23.layer.0.SelfAttention.q.lora_A.default.weight
base_model.model.language_model.encoder.block.23.layer.0.SelfAttention.q.lora_B.default.weight
base_model.model.language_model.encoder.block.23.layer.0.SelfAttention.v.lora_A.default.weight
base_model.model.language_model.encoder.block.23.layer.0.SelfAttention.v.lora_B.default.weight
base_model.model.language_model.decoder.block.0.layer.0.SelfAttention.q.lora_A.default.weight
base_model.model.language_model.decoder.block.0.layer.0.SelfAttention.q.lora_B.default.weight
base_model.model.language_model.decoder.block.0.layer.0.SelfAttention.v.lora_A.default.weight
base_model.model.language_model.decoder.block.0.layer.0.SelfAttention.v.lora_B.default.weight
base_model.model.language_model.decoder.block.0.layer.1.EncDecAttention.q.lora_A.default.weight
base_model.model.language_model.decoder.block.0.layer.1.EncDecAttention.q.lora_B.default.weight
base_model.model.language_model.decoder.block.0.layer.1.EncDecAttention.v.lora_A.default.weight
base_model.model.language_model.decoder.block.0.layer.1.EncDecAttention.v.lora_B.default.weight
base_model.model.language_model.decoder.block.1.layer.0.SelfAttention.q.lora_A.default.weight
base_model.model.language_model.decoder.block.1.layer.0.SelfAttention.q.lora_B.default.weight
base_model.model.language_model.decoder.block.1.layer.0.SelfAttention.v.lora_A.default.weight
base_model.model.language_model.decoder.block.1.layer.0.SelfAttention.v.lora_B.default.weight
base_model.model.language_model.decoder.block.1.layer.1.EncDecAttention.q.lora_A.default.weight
base_model.model.language_model.decoder.block.1.layer.1.EncDecAttention.q.lora_B.default.weight
base_model.model.language_model.decoder.block.1.layer.1.EncDecAttention.v.lora_A.default.weight
base_model.model.language_model.decoder.block.1.layer.1.EncDecAttention.v.lora_B.default.weight
base_model.model.language_model.decoder.block.2.layer.0.SelfAttention.q.lora_A.default.weight
base_model.model.language_model.decoder.block.2.layer.0.SelfAttention.q.lora_B.default.weight
base_model.model.language_model.decoder.block.2.layer.0.SelfAttention.v.lora_A.default.weight
base_model.model.language_model.decoder.block.2.layer.0.SelfAttention.v.lora_B.default.weight
base_model.model.language_model.decoder.block.2.layer.1.EncDecAttention.q.lora_A.default.weight
base_model.model.language_model.decoder.block.2.layer.1.EncDecAttention.q.lora_B.default.weight
base_model.model.language_model.decoder.block.2.layer.1.EncDecAttention.v.lora_A.default.weight
base_model.model.language_model.decoder.block.2.layer.1.EncDecAttention.v.lora_B.default.weight
base_model.model.language_model.decoder.block.3.layer.0.SelfAttention.q.lora_A.default.weight
base_model.model.language_model.decoder.block.3.layer.0.SelfAttention.q.lora_B.default.weight
base_model.model.language_model.decoder.block.3.layer.0.SelfAttention.v.lora_A.default.weight
base_model.model.language_model.decoder.block.3.layer.0.SelfAttention.v.lora_B.default.weight
base_model.model.language_model.decoder.block.3.layer.1.EncDecAttention.q.lora_A.default.weight
base_model.model.language_model.decoder.block.3.layer.1.EncDecAttention.q.lora_B.default.weight
base_model.model.language_model.decoder.block.3.layer.1.EncDecAttention.v.lora_A.default.weight
base_model.model.language_model.decoder.block.3.layer.1.EncDecAttention.v.lora_B.default.weight
base_model.model.language_model.decoder.block.4.layer.0.SelfAttention.q.lora_A.default.weight
base_model.model.language_model.decoder.block.4.layer.0.SelfAttention.q.lora_B.default.weight
base_model.model.language_model.decoder.block.4.layer.0.SelfAttention.v.lora_A.default.weight
base_model.model.language_model.decoder.block.4.layer.0.SelfAttention.v.lora_B.default.weight
base_model.model.language_model.decoder.block.4.layer.1.EncDecAttention.q.lora_A.default.weight
base_model.model.language_model.decoder.block.4.layer.1.EncDecAttention.q.lora_B.default.weight
base_model.model.language_model.decoder.block.4.layer.1.EncDecAttention.v.lora_A.default.weight
base_model.model.language_model.decoder.block.4.layer.1.EncDecAttention.v.lora_B.default.weight
base_model.model.language_model.decoder.block.5.layer.0.SelfAttention.q.lora_A.default.weight
base_model.model.language_model.decoder.block.5.layer.0.SelfAttention.q.lora_B.default.weight
base_model.model.language_model.decoder.block.5.layer.0.SelfAttention.v.lora_A.default.weight
base_model.model.language_model.decoder.block.5.layer.0.SelfAttention.v.lora_B.default.weight
base_model.model.language_model.decoder.block.5.layer.1.EncDecAttention.q.lora_A.default.weight
base_model.model.language_model.decoder.block.5.layer.1.EncDecAttention.q.lora_B.default.weight
base_model.model.language_model.decoder.block.5.layer.1.EncDecAttention.v.lora_A.default.weight
base_model.model.language_model.decoder.block.5.layer.1.EncDecAttention.v.lora_B.default.weight
base_model.model.language_model.decoder.block.6.layer.0.SelfAttention.q.lora_A.default.weight
base_model.model.language_model.decoder.block.6.layer.0.SelfAttention.q.lora_B.default.weight
base_model.model.language_model.decoder.block.6.layer.0.SelfAttention.v.lora_A.default.weight
base_model.model.language_model.decoder.block.6.layer.0.SelfAttention.v.lora_B.default.weight
base_model.model.language_model.decoder.block.6.layer.1.EncDecAttention.q.lora_A.default.weight
base_model.model.language_model.decoder.block.6.layer.1.EncDecAttention.q.lora_B.default.weight
base_model.model.language_model.decoder.block.6.layer.1.EncDecAttention.v.lora_A.default.weight
base_model.model.language_model.decoder.block.6.layer.1.EncDecAttention.v.lora_B.default.weight
base_model.model.language_model.decoder.block.7.layer.0.SelfAttention.q.lora_A.default.weight
base_model.model.language_model.decoder.block.7.layer.0.SelfAttention.q.lora_B.default.weight
base_model.model.language_model.decoder.block.7.layer.0.SelfAttention.v.lora_A.default.weight
base_model.model.language_model.decoder.block.7.layer.0.SelfAttention.v.lora_B.default.weight
base_model.model.language_model.decoder.block.7.layer.1.EncDecAttention.q.lora_A.default.weight
base_model.model.language_model.decoder.block.7.layer.1.EncDecAttention.q.lora_B.default.weight
base_model.model.language_model.decoder.block.7.layer.1.EncDecAttention.v.lora_A.default.weight
base_model.model.language_model.decoder.block.7.layer.1.EncDecAttention.v.lora_B.default.weight
base_model.model.language_model.decoder.block.8.layer.0.SelfAttention.q.lora_A.default.weight
base_model.model.language_model.decoder.block.8.layer.0.SelfAttention.q.lora_B.default.weight
base_model.model.language_model.decoder.block.8.layer.0.SelfAttention.v.lora_A.default.weight
base_model.model.language_model.decoder.block.8.layer.0.SelfAttention.v.lora_B.default.weight
base_model.model.language_model.decoder.block.8.layer.1.EncDecAttention.q.lora_A.default.weight
base_model.model.language_model.decoder.block.8.layer.1.EncDecAttention.q.lora_B.default.weight
base_model.model.language_model.decoder.block.8.layer.1.EncDecAttention.v.lora_A.default.weight
base_model.model.language_model.decoder.block.8.layer.1.EncDecAttention.v.lora_B.default.weight
base_model.model.language_model.decoder.block.9.layer.0.SelfAttention.q.lora_A.default.weight
base_model.model.language_model.decoder.block.9.layer.0.SelfAttention.q.lora_B.default.weight
base_model.model.language_model.decoder.block.9.layer.0.SelfAttention.v.lora_A.default.weight
base_model.model.language_model.decoder.block.9.layer.0.SelfAttention.v.lora_B.default.weight
base_model.model.language_model.decoder.block.9.layer.1.EncDecAttention.q.lora_A.default.weight
base_model.model.language_model.decoder.block.9.layer.1.EncDecAttention.q.lora_B.default.weight
base_model.model.language_model.decoder.block.9.layer.1.EncDecAttention.v.lora_A.default.weight
base_model.model.language_model.decoder.block.9.layer.1.EncDecAttention.v.lora_B.default.weight
base_model.model.language_model.decoder.block.10.layer.0.SelfAttention.q.lora_A.default.weight
base_model.model.language_model.decoder.block.10.layer.0.SelfAttention.q.lora_B.default.weight
base_model.model.language_model.decoder.block.10.layer.0.SelfAttention.v.lora_A.default.weight
base_model.model.language_model.decoder.block.10.layer.0.SelfAttention.v.lora_B.default.weight
base_model.model.language_model.decoder.block.10.layer.1.EncDecAttention.q.lora_A.default.weight
base_model.model.language_model.decoder.block.10.layer.1.EncDecAttention.q.lora_B.default.weight
base_model.model.language_model.decoder.block.10.layer.1.EncDecAttention.v.lora_A.default.weight
base_model.model.language_model.decoder.block.10.layer.1.EncDecAttention.v.lora_B.default.weight
base_model.model.language_model.decoder.block.11.layer.0.SelfAttention.q.lora_A.default.weight
base_model.model.language_model.decoder.block.11.layer.0.SelfAttention.q.lora_B.default.weight
base_model.model.language_model.decoder.block.11.layer.0.SelfAttention.v.lora_A.default.weight
base_model.model.language_model.decoder.block.11.layer.0.SelfAttention.v.lora_B.default.weight
base_model.model.language_model.decoder.block.11.layer.1.EncDecAttention.q.lora_A.default.weight
base_model.model.language_model.decoder.block.11.layer.1.EncDecAttention.q.lora_B.default.weight
base_model.model.language_model.decoder.block.11.layer.1.EncDecAttention.v.lora_A.default.weight
base_model.model.language_model.decoder.block.11.layer.1.EncDecAttention.v.lora_B.default.weight
base_model.model.language_model.decoder.block.12.layer.0.SelfAttention.q.lora_A.default.weight
base_model.model.language_model.decoder.block.12.layer.0.SelfAttention.q.lora_B.default.weight
base_model.model.language_model.decoder.block.12.layer.0.SelfAttention.v.lora_A.default.weight
base_model.model.language_model.decoder.block.12.layer.0.SelfAttention.v.lora_B.default.weight
base_model.model.language_model.decoder.block.12.layer.1.EncDecAttention.q.lora_A.default.weight
base_model.model.language_model.decoder.block.12.layer.1.EncDecAttention.q.lora_B.default.weight
base_model.model.language_model.decoder.block.12.layer.1.EncDecAttention.v.lora_A.default.weight
base_model.model.language_model.decoder.block.12.layer.1.EncDecAttention.v.lora_B.default.weight
base_model.model.language_model.decoder.block.13.layer.0.SelfAttention.q.lora_A.default.weight
base_model.model.language_model.decoder.block.13.layer.0.SelfAttention.q.lora_B.default.weight
base_model.model.language_model.decoder.block.13.layer.0.SelfAttention.v.lora_A.default.weight
base_model.model.language_model.decoder.block.13.layer.0.SelfAttention.v.lora_B.default.weight
base_model.model.language_model.decoder.block.13.layer.1.EncDecAttention.q.lora_A.default.weight
base_model.model.language_model.decoder.block.13.layer.1.EncDecAttention.q.lora_B.default.weight
base_model.model.language_model.decoder.block.13.layer.1.EncDecAttention.v.lora_A.default.weight
base_model.model.language_model.decoder.block.13.layer.1.EncDecAttention.v.lora_B.default.weight
base_model.model.language_model.decoder.block.14.layer.0.SelfAttention.q.lora_A.default.weight
base_model.model.language_model.decoder.block.14.layer.0.SelfAttention.q.lora_B.default.weight
base_model.model.language_model.decoder.block.14.layer.0.SelfAttention.v.lora_A.default.weight
base_model.model.language_model.decoder.block.14.layer.0.SelfAttention.v.lora_B.default.weight
base_model.model.language_model.decoder.block.14.layer.1.EncDecAttention.q.lora_A.default.weight
base_model.model.language_model.decoder.block.14.layer.1.EncDecAttention.q.lora_B.default.weight
base_model.model.language_model.decoder.block.14.layer.1.EncDecAttention.v.lora_A.default.weight
base_model.model.language_model.decoder.block.14.layer.1.EncDecAttention.v.lora_B.default.weight
base_model.model.language_model.decoder.block.15.layer.0.SelfAttention.q.lora_A.default.weight
base_model.model.language_model.decoder.block.15.layer.0.SelfAttention.q.lora_B.default.weight
base_model.model.language_model.decoder.block.15.layer.0.SelfAttention.v.lora_A.default.weight
base_model.model.language_model.decoder.block.15.layer.0.SelfAttention.v.lora_B.default.weight
base_model.model.language_model.decoder.block.15.layer.1.EncDecAttention.q.lora_A.default.weight
base_model.model.language_model.decoder.block.15.layer.1.EncDecAttention.q.lora_B.default.weight
base_model.model.language_model.decoder.block.15.layer.1.EncDecAttention.v.lora_A.default.weight
base_model.model.language_model.decoder.block.15.layer.1.EncDecAttention.v.lora_B.default.weight
base_model.model.language_model.decoder.block.16.layer.0.SelfAttention.q.lora_A.default.weight
base_model.model.language_model.decoder.block.16.layer.0.SelfAttention.q.lora_B.default.weight
base_model.model.language_model.decoder.block.16.layer.0.SelfAttention.v.lora_A.default.weight
base_model.model.language_model.decoder.block.16.layer.0.SelfAttention.v.lora_B.default.weight
base_model.model.language_model.decoder.block.16.layer.1.EncDecAttention.q.lora_A.default.weight
base_model.model.language_model.decoder.block.16.layer.1.EncDecAttention.q.lora_B.default.weight
base_model.model.language_model.decoder.block.16.layer.1.EncDecAttention.v.lora_A.default.weight
base_model.model.language_model.decoder.block.16.layer.1.EncDecAttention.v.lora_B.default.weight
base_model.model.language_model.decoder.block.17.layer.0.SelfAttention.q.lora_A.default.weight
base_model.model.language_model.decoder.block.17.layer.0.SelfAttention.q.lora_B.default.weight
base_model.model.language_model.decoder.block.17.layer.0.SelfAttention.v.lora_A.default.weight
base_model.model.language_model.decoder.block.17.layer.0.SelfAttention.v.lora_B.default.weight
base_model.model.language_model.decoder.block.17.layer.1.EncDecAttention.q.lora_A.default.weight
base_model.model.language_model.decoder.block.17.layer.1.EncDecAttention.q.lora_B.default.weight
base_model.model.language_model.decoder.block.17.layer.1.EncDecAttention.v.lora_A.default.weight
base_model.model.language_model.decoder.block.17.layer.1.EncDecAttention.v.lora_B.default.weight
base_model.model.language_model.decoder.block.18.layer.0.SelfAttention.q.lora_A.default.weight
base_model.model.language_model.decoder.block.18.layer.0.SelfAttention.q.lora_B.default.weight
base_model.model.language_model.decoder.block.18.layer.0.SelfAttention.v.lora_A.default.weight
base_model.model.language_model.decoder.block.18.layer.0.SelfAttention.v.lora_B.default.weight
base_model.model.language_model.decoder.block.18.layer.1.EncDecAttention.q.lora_A.default.weight
base_model.model.language_model.decoder.block.18.layer.1.EncDecAttention.q.lora_B.default.weight
base_model.model.language_model.decoder.block.18.layer.1.EncDecAttention.v.lora_A.default.weight
base_model.model.language_model.decoder.block.18.layer.1.EncDecAttention.v.lora_B.default.weight
base_model.model.language_model.decoder.block.19.layer.0.SelfAttention.q.lora_A.default.weight
base_model.model.language_model.decoder.block.19.layer.0.SelfAttention.q.lora_B.default.weight
base_model.model.language_model.decoder.block.19.layer.0.SelfAttention.v.lora_A.default.weight
base_model.model.language_model.decoder.block.19.layer.0.SelfAttention.v.lora_B.default.weight
base_model.model.language_model.decoder.block.19.layer.1.EncDecAttention.q.lora_A.default.weight
base_model.model.language_model.decoder.block.19.layer.1.EncDecAttention.q.lora_B.default.weight
base_model.model.language_model.decoder.block.19.layer.1.EncDecAttention.v.lora_A.default.weight
base_model.model.language_model.decoder.block.19.layer.1.EncDecAttention.v.lora_B.default.weight
base_model.model.language_model.decoder.block.20.layer.0.SelfAttention.q.lora_A.default.weight
base_model.model.language_model.decoder.block.20.layer.0.SelfAttention.q.lora_B.default.weight
base_model.model.language_model.decoder.block.20.layer.0.SelfAttention.v.lora_A.default.weight
base_model.model.language_model.decoder.block.20.layer.0.SelfAttention.v.lora_B.default.weight
base_model.model.language_model.decoder.block.20.layer.1.EncDecAttention.q.lora_A.default.weight
base_model.model.language_model.decoder.block.20.layer.1.EncDecAttention.q.lora_B.default.weight
base_model.model.language_model.decoder.block.20.layer.1.EncDecAttention.v.lora_A.default.weight
base_model.model.language_model.decoder.block.20.layer.1.EncDecAttention.v.lora_B.default.weight
base_model.model.language_model.decoder.block.21.layer.0.SelfAttention.q.lora_A.default.weight
base_model.model.language_model.decoder.block.21.layer.0.SelfAttention.q.lora_B.default.weight
base_model.model.language_model.decoder.block.21.layer.0.SelfAttention.v.lora_A.default.weight
base_model.model.language_model.decoder.block.21.layer.0.SelfAttention.v.lora_B.default.weight
base_model.model.language_model.decoder.block.21.layer.1.EncDecAttention.q.lora_A.default.weight
base_model.model.language_model.decoder.block.21.layer.1.EncDecAttention.q.lora_B.default.weight
base_model.model.language_model.decoder.block.21.layer.1.EncDecAttention.v.lora_A.default.weight
base_model.model.language_model.decoder.block.21.layer.1.EncDecAttention.v.lora_B.default.weight
base_model.model.language_model.decoder.block.22.layer.0.SelfAttention.q.lora_A.default.weight
base_model.model.language_model.decoder.block.22.layer.0.SelfAttention.q.lora_B.default.weight
base_model.model.language_model.decoder.block.22.layer.0.SelfAttention.v.lora_A.default.weight
base_model.model.language_model.decoder.block.22.layer.0.SelfAttention.v.lora_B.default.weight
base_model.model.language_model.decoder.block.22.layer.1.EncDecAttention.q.lora_A.default.weight
base_model.model.language_model.decoder.block.22.layer.1.EncDecAttention.q.lora_B.default.weight
base_model.model.language_model.decoder.block.22.layer.1.EncDecAttention.v.lora_A.default.weight
base_model.model.language_model.decoder.block.22.layer.1.EncDecAttention.v.lora_B.default.weight
base_model.model.language_model.decoder.block.23.layer.0.SelfAttention.q.lora_A.default.weight
base_model.model.language_model.decoder.block.23.layer.0.SelfAttention.q.lora_B.default.weight
base_model.model.language_model.decoder.block.23.layer.0.SelfAttention.v.lora_A.default.weight
base_model.model.language_model.decoder.block.23.layer.0.SelfAttention.v.lora_B.default.weight
base_model.model.language_model.decoder.block.23.layer.1.EncDecAttention.q.lora_A.default.weight
base_model.model.language_model.decoder.block.23.layer.1.EncDecAttention.q.lora_B.default.weight
base_model.model.language_model.decoder.block.23.layer.1.EncDecAttention.v.lora_A.default.weight
base_model.model.language_model.decoder.block.23.layer.1.EncDecAttention.v.lora_B.default.weight
qwqwq1445 commented 2 months ago

Will you kindly release the checkpoint of the best BLIP2 in OKVQA?

I will check if I have such a checkpoint. But you can easily train it with the codebase by setting K=1 and comment out the part that appends retrieved documents to query.

PPPPlease could you release the best checkpoint? I really have difficulties in reproducing the results(crying

LinWeizheDragon commented 2 months ago

Sorry, I forgot to upload it after I downloaded the checkpoint a few days ago. The checkpoint is here: https://drive.google.com/file/d/1crK9raqB_zWeybLvewTUVxpqiC4p89kG/view?usp=sharing

I just ran a test on this checkpoint using the following script:

python src/main.py \
    --experiment_name "OKVQA_RAG_BLIP2(t5-xl)" \
    --config "configs/rag/okvqa/RAG_BLIP2_with_FLMR.jsonnet" \
    --modules static_retrieval ignore_knowledge_passages \
    --reset --override \
    --mode test \
    --test_suffix K0 \
    --opts test.trainer_paras.accelerator=auto \
             test.trainer_paras.devices=auto \
             test.trainer_paras.strategy=ddp_find_unused_parameters_true \
             test.trainer_paras.precision="bf16" \
             test.batch_size=16 \
             model_config.num_beams=2 \
             model_config.num_knowledge_passages=1 \
             model_config.num_knowledge_passages_in_training=1 \
             train.load_model_path="path/to/model_step_3755.ckpt" \

The test results are:

wandb:   K0_test/OKVQADataset.test/accuracy_AnswerType_other 55.77
wandb: K0_test/OKVQADataset.test/accuracy_QuestionType_eight 57.79
wandb:  K0_test/OKVQADataset.test/accuracy_QuestionType_five 57.47
wandb:  K0_test/OKVQADataset.test/accuracy_QuestionType_four 56.54
wandb:  K0_test/OKVQADataset.test/accuracy_QuestionType_nine 49.76
wandb:   K0_test/OKVQADataset.test/accuracy_QuestionType_one 51.81
wandb: K0_test/OKVQADataset.test/accuracy_QuestionType_other 56.06
wandb: K0_test/OKVQADataset.test/accuracy_QuestionType_seven 55.09
wandb:   K0_test/OKVQADataset.test/accuracy_QuestionType_six 51.06
wandb:   K0_test/OKVQADataset.test/accuracy_QuestionType_ten 60.0
wandb: K0_test/OKVQADataset.test/accuracy_QuestionType_three 56.68
wandb:   K0_test/OKVQADataset.test/accuracy_QuestionType_two 55.81
wandb:            K0_test/OKVQADataset.test/accuracy_overall 55.77
qwqwq1445 commented 2 months ago

Sorry, I forgot to upload it after I downloaded the checkpoint a few days ago. The checkpoint is here: https://drive.google.com/file/d/1crK9raqB_zWeybLvewTUVxpqiC4p89kG/view?usp=sharing

I just ran a test on this checkpoint using the following script:

python src/main.py \
    --experiment_name "OKVQA_RAG_BLIP2(t5-xl)" \
    --config "configs/rag/okvqa/RAG_BLIP2_with_FLMR.jsonnet" \
    --modules static_retrieval ignore_knowledge_passages \
    --reset --override \
    --mode test \
    --test_suffix K0 \
    --opts test.trainer_paras.accelerator=auto \
             test.trainer_paras.devices=auto \
             test.trainer_paras.strategy=ddp_find_unused_parameters_true \
             test.trainer_paras.precision="bf16" \
             test.batch_size=16 \
             model_config.num_beams=2 \
             model_config.num_knowledge_passages=1 \
             model_config.num_knowledge_passages_in_training=1 \
             train.load_model_path="path/to/model_step_3755.ckpt" \

The test results are:

wandb:   K0_test/OKVQADataset.test/accuracy_AnswerType_other 55.77
wandb: K0_test/OKVQADataset.test/accuracy_QuestionType_eight 57.79
wandb:  K0_test/OKVQADataset.test/accuracy_QuestionType_five 57.47
wandb:  K0_test/OKVQADataset.test/accuracy_QuestionType_four 56.54
wandb:  K0_test/OKVQADataset.test/accuracy_QuestionType_nine 49.76
wandb:   K0_test/OKVQADataset.test/accuracy_QuestionType_one 51.81
wandb: K0_test/OKVQADataset.test/accuracy_QuestionType_other 56.06
wandb: K0_test/OKVQADataset.test/accuracy_QuestionType_seven 55.09
wandb:   K0_test/OKVQADataset.test/accuracy_QuestionType_six 51.06
wandb:   K0_test/OKVQADataset.test/accuracy_QuestionType_ten 60.0
wandb: K0_test/OKVQADataset.test/accuracy_QuestionType_three 56.68
wandb:   K0_test/OKVQADataset.test/accuracy_QuestionType_two 55.81
wandb:            K0_test/OKVQADataset.test/accuracy_overall 55.77

This is the best results without GS knowledge and with text-based vision features?

LinWeizheDragon commented 2 months ago

Yes, I think so.

qwqwq1445 commented 1 month ago

In finetuning BLIP2, you concatenate the additional information like ocr results, object tags, and image captions with the question. I wonder if you could give me an example of how you put them together, like "Knowledge: {} Question: {} OCR: {} Object Tags: {} Image Caption: {}"? I am especially confused about how to translate the object tag information into text format. I have difficulties reproducing the results on my own code, and I'll really appreciate it if you could kindly help.

LinWeizheDragon commented 1 month ago

Hi, you can reopen this issue or create a new issue when posting a new question.

The format should be "Question: [question] Caption: [image caption]. Objects: [obj1, obj2, obj3, ...] [ocr if exists]. Knowledge: [retrieved passage] Answer: " The relevant code is here: https://github.com/LinWeizheDragon/Retrieval-Augmented-Visual-Question-Answering/blob/7b109279fe9022df4250e34f28de8fe549e84d24/src/models/rag/rag_model_blip.py#L602