clip-vil / CLIP-ViL

[ICLR 2022] code for "How Much Can CLIP Benefit Vision-and-Language Tasks?" https://arxiv.org/abs/2107.06383
MIT License
401 stars 35 forks source link

Captioning model training script fails #2

Closed j-min closed 3 years ago

j-min commented 3 years ago

Hi, I followed the data preparation and ran the training script for the default CLIP-RN50 model in the readme. However, the training job crashes with the log below. Could you please check if the current example training script is runnable?

$ /scratch-space/CLIP-ViL/CLIP-ViL-Direct/caption 
> python tools/train.py --cfg configs/phrase1/clip_rn50_transformer_scl.yml
Warning: key N_enc not in args
Warning: key N_dec not in args
Warning: key d_model not in args
Warning: key d_ff not in args
Warning: key num_att_heads not in args
Warning: key dropout not in args
Warning: key REFORWARD not in args
DataLoader loading json file:  data/cocotalk.json
vocab size is  9487
DataLoader loading h5 file:  data/cocotalk_clip_RN50_fc data/cocotalk_clip_RN50_att data/cocotalk_box data/cocotalk_label.h5
max sequence length in data is 16
read 123287 image features
assigned 113287 images to split train
assigned 5000 images to split val
assigned 5000 images to split test
Read data: 0.0007147789001464844
/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/_functions.py:65: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
iter 0 (epoch 0), train_loss = 9.158, time/batch = 25.627
Read data: 0.0002460479736328125
iter 1000 (epoch 0), train_loss = 4.920, time/batch = 0.183
Read data: 0.00023293495178222656
iter 2000 (epoch 0), train_loss = 3.784, time/batch = 0.194
Traceback (most recent call last):
  File "tools/train.py", line 293, in <module>
    train(opt)
  File "tools/train.py", line 246, in train
    val_loss, predictions, lang_stats = eval_utils.eval_split(
  File "/scratch-space/CLIP-ViL/CLIP-ViL-Direct/caption/captioning/utils/eval_utils.py", line 171, in eval_split
    seq, seq_logprobs = model(fc_feats, att_feats, att_masks, opt=tmp_eval_kwargs, mode='sample')
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 167, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 177, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "/opt/conda/lib/python3.8/site-packages/torch/_utils.py", line 429, in reraise
    raise self.exc_type(msg)
TypeError: Caught TypeError in replica 5 on device 5.
Original Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/scratch-space/CLIP-ViL/CLIP-ViL-Direct/caption/captioning/models/CaptionModel.py", line 33, in forward
    return getattr(self, '_'+mode)(*args, **kwargs)
TypeError: _sample() missing 2 required positional arguments: 'fc_feats' and 'att_feats'
sIncerass commented 3 years ago

Could you please use just one GPU to launch the experiments and see whether it will give you the same issue? I am always using one GPU which seems to be fine to train for a while.

j-min commented 3 years ago

I succeeded in training the CLIP-RN50 model with a single gpu. Below is the evaluation result on the Karpathy test split. Could you please confirm that you saw similar results? I'd like to make sure I ran the script correctly since there are only SCST results, without any scores of MLE-based training in the paper.

{'Bleu_1': 0.7468949383241016,
'Bleu_2': 0.5810452442634662,
'Bleu_3': 0.44466729659825666,
'Bleu_4': 0.34033720376175897,
'METEOR': 0.27332929031601105,
'ROUGE_L': 0.5538997707453996,
'CIDEr': 1.1062112636066934,
'SPICE': 0.20551119536241158,
'WMD': 0.5629711371394192,
'perplexity': 0.5224986911565065,
'entropy': 1.3382156711161137,
'SPICE_Relation': 0.05422546691107958,
'SPICE_Cardinality': 0.07816728167281671,
'SPICE_Attribute': 0.11556938213280309,
'SPICE_Size': 0.05356541268950027,
'SPICE_Color': 0.14668237505316156,
'SPICE_Object': 0.36858569451532297,
'bad_count_rate': 0.001}

Btw, there are some bugs in tools/eval.py. I needed to 1) comment out the line from captioning.data.dataloaderraw import * in , since there's no dataloaderraw.py in captioning.data. 2) and create a vis/ directory to avoid error in the last lines:

if opt.dump_json == 1:
    # dump the json
    json.dump(split_predictions, open('vis/vis.json', 'w'))
j-min commented 3 years ago

I also tried to run SCST with cider with the command in the readme but faced this error.

Traceback (most recent call last):
  File "tools/train.py", line 293, in <module>
    train(opt)
  File "tools/train.py", line 154, in train
    init_scorer(opt.cached_tokens)
  File "/scratch-space/CLIP-ViL/CLIP-ViL-Direct/caption/captioning/utils/rewards.py", line 27, in init_scorer
    CiderD_scorer = CiderD_scorer or CiderD(df=cached_tokens)
  File "cider/pyciderevalcap/ciderD/ciderD.py", line 28, in __init__
    self.cider_scorer = CiderScorer(n=self._n, df_mode=self._df)
  File "cider/pyciderevalcap/ciderD/ciderD_scorer.py", line 80, in __init__
    pkl_file = cPickle.load(open(os.path.join('data', df_mode + '.p'),'rb'), **(dict(encoding='latin1') if six.PY3 else {}))
FileNotFoundError: [Errno 2] No such file or directory: 'data/coco-train-idxs.p'

It seems like coco-train-idxs.p is related to cached_tokens argument in opts.py. Do we need this file before starting SCST?

parser.add_argument('--cached_tokens', type=str, default='coco-train-idxs',
                    help='Cached token file for calculating cider score during self critical training.')
j-min commented 3 years ago

I found coco-train-idxs.p from the ImageCaptioning.pytorch author's Gdrive. Could you please confirm if I can use this file?

sIncerass commented 3 years ago

Sure, sorry I should put the coco-train-idxs.p in the original repo and the MLE results seem to be close to what we have.

sIncerass commented 3 years ago

Thanks for your participation, I added more clarification in the README file.

liujiaheng commented 2 years ago

@j-min Have you successfully re-implemented the results of CLIP features for image caption based on R50?

image

In this paper, the result of B@4 is 38.6, but I only get the results of 32.0 as follows.

image

sIncerass commented 2 years ago

Hi @liujiaheng, is the result for the 5000 karpathy split and using the two phrase training?

liujiaheng commented 2 years ago

@sIncerass Can you provide the results on 5000 karpathy split of phase1 training based on R50 with CLIP features?

liujiaheng commented 2 years ago

{'Bleu_1': 0.7514197530864043, 'Bleu_2': 0.5850870427589987, 'Bleu_3': 0.43979903834664746, 'Bleu_4': 0.3263438183989057, 'METEOR': 0.27168378207273514, 'ROUGE_L': 0.553913453445984, 'CIDEr': 1.092061834034407, 'SPICE': 0.20618343143439724, 'WMD': 0.5600404048324764, 'perplexity': 0.6155193274050951, 'entropy': 1.6811169483423234, 'SPICE_Relation': 0.06140779121218033, 'SPICE_Cardinality': 0.13671586715867157, 'SPICE_Attribute': 0.09548424840112993, 'SPICE_Size': 0.045554931686318544, 'SPICE_Color': 0.10765492310436131, 'SPICE_Object': 0.378444525407513, 'bad_count_rate': 0.0036} These are the results I have reimplemented.

sIncerass commented 2 years ago

Hi @liujiaheng, here is the detailed evaluation result I got

{'Bleu_1': 0.8007510865437192, 'Bleu_2': 0.6449325885322664, 'Bleu_3': 0.5008726056265775, 'Bleu_4': 0.381525622275518, 'METEOR': 0.2866526250312671, 'ROUGE_L': 0.583230803123437, 'CIDEr': 1.2582568357814914, 'SPICE': 0.2246589023419364, 'WMD': 0.26410401436038744, 'perplexity': 0.07665397078062525, 'entropy': 0.17617339862971568, 'SPICE_Relation': 0.06592518994248969, 'SPICE_Cardinality': 0.19065190651906516, 'SPICE_Attribute': 0.11274409396317843, 'SPICE_Size': 0.037413438143365146, 'SPICE_Color': 0.13960399775006516, 'SPICE_Object': 0.40403633034320335, 'bad_count_rate': 0.0006}