Question for v2 finetuning on coco caption

simplewhite9 commented 1 year ago

Hello, I am trying to reproduce llama-adapter v2 trained solely on COCO caption (referring to Table 3 in the paper) and I have a few questions regarding the reproduction.

Is any instructions such as "Generate caption for this image" not used during training?
Are the parameters in norm, linear, and scale of LLaMA trained along with the adapter and visual projection layers? Thanks.

csuhan commented 1 year ago

Thanks for your interest! The item in Tab.3 should be LLaMA-Adapter V1. LLaMA-Adapter V2 performs not good under traditional COCO caption metrics since it usually generate much longer captions. We will fix it in the revision.

Yes. See our caption demo at https://huggingface.co/spaces/csuhan/LLaMA-Adapter/blob/48d8b02c0c335145b8b3d1ca7162ac42979bec93/app.py#L137
Yes.

If you want to train LLaMA-Adapter V2 on COCO, please check our arxiv paper for more details. We will also release LLaMA-Adapter V2's training code soon.

yash-s20 commented 1 year ago

Hi! I've tried replicating the results by training the model under llama_adapter_v2_multimodal on train2014 (training data for COCO Captioning Task) for 150 epochs. I'm using exp/pretrain.sh, all BIAS-7B as the starting model. However the quality of outputs is simply not like the model in the demo available on http://llama-adapter.opengvlab.com/. For e.g. take this image (from the validation dataset of COCO captioning): COCO_val2014_000000000143

For the prompt "Generate a caption for this image" the demo gives a detailed and high quality caption:

However, when prompted with the same question, the adapter model replies a simple:

The demo model is also able to do much better job at answering specific questions about the image, like how many birds, or type of bird in the image.

What am I missing in the training pipeline? Does it need to be finetuned on alpaca_gpt4_data.json etc for better quality outputs? Is the model in the demo just using bigger LLaMA model?

Also is there a built-in script to score generated results on test/val captions?

verigle commented 1 year ago

could you please share the file of coco.csv or code of generate coco.csv

yash-s20 commented 1 year ago

Hi @verigle . The file is too big to attach here.

The code itself is simple, I generated a tab separated csv from the annotations files using the image id and caption. Run this with 2 arguments: json file with annotations, and the csv file you want to write to. You can change "/path/to/" to wherever you store the training data.

import json
import csv
import sys
import pandas as pd

if __name__ == "__main__":
    if len(sys.argv) != 3:
        print(f"{len(sys.argv)} is incorrect number of arguments")
        exit()
    data = json.load(open(sys.argv[1], 'r'))
    captions = data['annotations']
    captions.sort(key=lambda y: y['image_id'])
    tabular_data = []
    img_path = '/path/to/train2014/COCO_train2014_000000000000'
    for x in captions:
        image_id = str(x['image_id'])
        cap = x['caption']
        path = img_path[:-len(image_id)] + image_id + '.jpg'
        tabular_data.append((path, cap))
    tabular_data = pd.DataFrame(tabular_data, columns=['url', 'caption'])
    tabular_data.to_csv(open(sys.argv[2], 'w'), sep='\t', index=False)

If you find issues with this, please reply back!

merlinarer commented 1 year ago

Hi @verigle . The file is too big to attach here.

The code itself is simple, I generated a tab separated csv from the annotations files using the image id and caption. Run this with 2 arguments: json file with annotations, and the csv file you want to write to. You can change "/path/to/" to wherever you store the training data.
import json
import csv
import sys
import pandas as pd

if __name__ == "__main__":
    if len(sys.argv) != 3:
        print(f"{len(sys.argv)} is incorrect number of arguments")
        exit()
    data = json.load(open(sys.argv[1], 'r'))
    captions = data['annotations']
    captions.sort(key=lambda y: y['image_id'])
    tabular_data = []
    img_path = '/path/to/train2014/COCO_train2014_000000000000'
    for x in captions:
        image_id = str(x['image_id'])
        cap = x['caption']
        path = img_path[:-len(image_id)] + image_id + '.jpg'
        tabular_data.append((path, cap))
    tabular_data = pd.DataFrame(tabular_data, columns=['url', 'caption'])
    tabular_data.to_csv(open(sys.argv[2], 'w'), sep='\t', index=False)
If you find issues with this, please reply back!

Thank for sharing. If using the same setting as exp/pretrain.sh, the total steps is 0.6M/8/4*150 ~= 2.8M steps. How much time it cost ? and could you please share the training log?

OpenGVLab / LLaMA-Adapter

Question for v2 finetuning on coco caption #68