RitaRamo / smallcap

SmallCap: Lightweight Image Captioning Prompted with Retrieval Augmentation
94 stars 20 forks source link


We now have a demo, check it out: https://huggingface.co/spaces/RitaParadaRamos/SmallCapDemo :v:


The code was developed in Python 3.9.

conda create -n smallcap python=3.9
conda activate smallcap
pip install -r requirements.txt

Evaluation package

Download Stanford models for computing SPICE (a slightly modified version of this repo):


Interacting with SmallCap

Our pretrained model is available on HuggingFace at Yova/SmallCap7M.

To use it, you also need the retrieval datastore:

mkdir datastore

Download the COCO index and associated captions and place them in datastore/.

See SmallCap_demo.inynb for a demo of our pretrained model.

Training SmallCap

Click to expand ### Data Download the COCO Karpathy splits file `dataset_coco.json` from [here](https://www.kaggle.com/datasets/shtvkumar/karpathy-splits) and place it in `data/`. Download all COCO images (train, val and test, 2017 version) from [here](https://cocodataset.org/#download) and place them in `data/images`. The expected naming format is twelve digits followed by a `.jpg` extension, e.g. `data/images/000000000001.jpg` for image with COCO id `1`. ### Preprocessing At the moment CLIP models based on ResNet are not available through HuggingFace so it is necessary to also install the original CLIP implementation from [here](https://github.com/openai/CLIP): ``` pip install git+https://github.com/openai/CLIP.git ``` Extract train and val features: ``` mkdir features python src/extract_features.py ``` Retrieve captions ```python src/retrieve_captions.py``` ### Model training ```python train.py``` Models are saved under name _M, e.g. `rag_7M` for a model trained with retrieval augmentation and 7M trainable parameters. ### Inference ```python infer.py --model_path ``` If you also specify `--checkpoint_path` inference runs with only that checkpoint. Else, all checkpoints in `--model_path` are used. If you specify `--infer_test` inference uses test data, else val data is used. E.g. to run inference on the test split with model `rag_7M`, checkpoint `17712`, run ```python infer.py --model_path experiments/rag_7M --checkpoint_path checkpoint-17712 --infer_test``` The model predictions are stored as ```_preds.json``` in each respective checkpoint subdirectory. Note: You can safely ignore the warning `Some weights of ThisGPT2LMHeadModel were not initialized from the model checkpoint at gpt2 and are newly initialized...` It occurs because a new model is first built and then the pre-trained parameters are loaded into it. ### Evaluate predictions ```python coco-caption/run_eval.py ```


If you find our code/data/models or ideas useful in your research, please consider citing the paper:

  title={SmallCap: Lightweight Image Captioning Prompted with Retrieval Augmentation},
  author={Ramos, Rita and Martins, Bruno and Elliott, Desmond and Kementchedjhieva, Yova},