about the reproduce of metrics reported in paper?

hugddygff commented 5 years ago

Hi Yang, I am reproducing the experiments in the paper. Can you give some hints of the best combination of hyper-parameters? Such as batch_size, learning_rate, and how many steps is most suitable? I am now using the default parameters in the paper, and training for 7000 steps, obtaining 6.1 of CIDEr, which is far away from the 20+ in the paper. Can you give som help to me?

Thanks very much!

fengyang0317 commented 5 years ago

The parameters are shown in the code or training commands. If you are training without initialization, you should train at least 20000 iterations.

By the way, the checkpoints are released in google drive.

hugddygff commented 5 years ago

@fengyang0317 Thanks for your quick reply, I used the model-10000 which you provided and obtained the results in the paper. Is this model using obejct2sent initialization? I tried the model-20000 (trained by myself) without initialization and got 5 of CIDEr, which is really bad. Can you help me?

fengyang0317 commented 5 years ago

Are you training using the following command?

python im_caption_full.py --inc_ckpt ~/workspace/ckpt/inception_v4.ckpt\
  --multi_gpu --batch_size 512 --save_checkpoint_steps 1000\
  --gen_lr 0.001 --dis_lr 0.001

In Google drive, saving folder contains the checkpoints with initialization, no_init folder contains the checkpoints without initialization.

hugddygff commented 5 years ago

yes, I just download the code without polishing it. But I use batch_size 256 without multi-gpu, is this important?

fengyang0317 commented 5 years ago

I think the batch_size 256 is ok. You may need to reduce the learning rate too. Can you try 5e-4?

hugddygff commented 5 years ago

Thanks, I will try 5e-4.

------------------ 原始邮件 ------------------ 发件人: "Yang Feng"notifications@github.com; 发送时间: 2019年6月28日(星期五) 中午11:10 收件人: "fengyang0317/unsupervised_captioning"unsupervised_captioning@noreply.github.com; 抄送: "旋风小子"1025503986@qq.com;"Author"author@noreply.github.com; 主题: Re: [fengyang0317/unsupervised_captioning] about the reproduce ofmetrics reported in paper? (#4)

I think the batch_size 256 is ok. You may need to reduce the learning rate too. Can you try 5e-4?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

hugddygff commented 5 years ago

and, I found a small problem. The size of the my trained model/graph file is a little different from yours as below. yours ours(I directly run the command. )

fengyang0317 commented 5 years ago

Are you using tensorflow==1.13.1?

hugddygff commented 5 years ago

I use tf 1.12， but without problems of training and evaluation.

fengyang0317 commented 5 years ago

1.12 should be fine. I used some earlier versions in 2018.

hugddygff commented 5 years ago

@fengyang0317 Thanks for your patient help, I solve this problem successfully.

fengyang0317 commented 5 years ago

How did you solve the problem? Someone else is facing similar issues.

hugddygff commented 5 years ago

@fengyang0317 When I trained for second time , everything is OK. My analysis is as below, maybe it's wrong. I found that the test metrics are very related with the training loss. And although the number of training epochs is big enough, there will be noisy loss i.e. negative number(the ordinary loss is about 2.5) . So when I tested for the model at the first time, the loss is negative exactly.

songpipi commented 5 years ago

What means trained for second time, training from scratch? It just increases the training time, how it get over the impact of negative loss.

fengyang0317 commented 5 years ago

I inspect the rewards in Tensorboard to see whether the training is healthy. The event files are also provided in Google Drive.

hugddygff commented 5 years ago

@songpipi I means that I delete the models and rerun the code again without any changes. I advice that you can test more models, i.e. 20000,21000,22000...., only one may have problem.

hugddygff commented 5 years ago

@songpipi since only one model may deviate from stable value. And another advice is to delete the models and rerun the image_cap_full.py again.

hugddygff commented 5 years ago

@songpipi the stable value = policy rewards. The name in the tensorboard is loss, and the corresponding name in the file the author provided is loss1.

songpipi commented 5 years ago

Sorry I still haven't solved this problem, I eval all 30000 model , all is bad. The only thing I changed was using unpaired caption, and I masked the code about glove embedding in process_descriptions.py. I don't know how to troubleshoot, can you give me some advice, please?

hugddygff commented 5 years ago

@songpipi You can first retrain the model for 20000 steps and observe the performances (do not mask the code of glove embedding). And what is meaning of "The only thing I changed was using unpaired caption", the default setting of the author is using unpaired crawled captions.

songpipi commented 5 years ago

I use the unpaired COCO captions, I want to reproduce the results of Table 2 in the paper. So word_counts and sentence.tfrec is different from that provided by author. Now I'm trying to find problem from loss values, I find loss can be negative at terminal.

fengyang0317 commented 5 years ago

Could you first try to reproduce the results in table 1?

songpipi commented 5 years ago

@fengyang0317 Yes，now I can roughly reproduce the results in table 1. Unpaired setting hasn't worked yet. Apart from the input data, are there any differences between the two experiments?

fengyang0317 commented 5 years ago

The difference lays in the processing. You need to generate a new word_count file and change the subsequent files.

songpipi commented 5 years ago

        How can I get the new glove_vocab.pkl？

                            songpp beta

                                邮箱：beta.songpp@gmail.com

    Signature is customized by Netease Mail Master

        On 07/06/2019 01:04, Yang Feng wrote: The difference lays in the processing. You need to generate a new word_count file and change the subsequent files.

—You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub, or mute the thread. [ { "@context": "http://schema.org", "@type": "EmailMessage", "potentialAction": { "@type": "ViewAction", "target": "https://github.com/fengyang0317/unsupervised_captioning/issues/4?email_source=notifications\u0026email_token=AH7D6HCVBSLVCJBAGLRQALTP555JNA5CNFSM4H3JH2I2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZJ64UI#issuecomment-508816977", "url": "https://github.com/fengyang0317/unsupervised_captioning/issues/4?email_source=notifications\u0026email_token=AH7D6HCVBSLVCJBAGLRQALTP555JNA5CNFSM4H3JH2I2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZJ64UI#issuecomment-508816977", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { "@type": "Organization", "name": "GitHub", "url": "https://github.com" } } ]

fengyang0317 commented 5 years ago

You don't need glove_vocab.pkl in the unpaired setting. Just use all the words appearing more than four times in MSCOCO training set.

hugddygff commented 5 years ago

hi yang, I want to ask the performance without object reward， are the metrics much worse than that of the method with object reward. If possible, can you share some results on coco test. thanks------------------ 原始邮件 ------------------ 发件人: "Yang Feng"notifications@github.com 发送时间: 2019年7月6日(星期六) 中午11:01 收件人: "fengyang0317/unsupervised_captioning"unsupervised_captioning@noreply.github.com; 抄送: "kas-one"1025503986@qq.com;"State change"state_change@noreply.github.com; 主题: Re: [fengyang0317/unsupervised_captioning] about the reproduce ofmetrics reported in paper? (#4)

You don't need glove_vocab.pkl in the unpaired setting. Just use all the words appearing more than four times in MSCOCO training set.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or mute the thread.

fengyang0317 commented 5 years ago

@hugddygff I have not tried to train the model you mentioned. I will train one a few months later and share the results.

songpipi commented 5 years ago

@hugddygff Hi, have you tried to reproduce the results of table 2? I had some troubles reproducing the results.

songpipi commented 5 years ago

@fengyang0317 Hi Yang, I try to reproduce the results in table 2, I replace the crawled sentence corpus with the COCO captions, I get a new vocabulary with 11321 words( frequency >= 4 plus category names). Then I use the new word_counts.txt, sentence.pkl and sentence.tfrec, also I change the vocab_size, start_id, and end_id in config.py, everything else stays the same. Without initialization, I trained 40000 epoch, I randomly tested five models from 20000 to 40000 epoch, CIDEr metrics are all around 6. This is very bad. Maybe, the hyperparameters need to be tuned? Can you give me some advice, thank you.

fengyang0317 commented 5 years ago

In fact, the hyperparameters reported in the paper are the same for both datasets. I will provide the input files and checkpoints for MSCOCO later.

songpipi commented 5 years ago

@fengyang0317 Sorry, it's still unresolved, I try set w_mse=0.2 for COCO captions, but it also doesn't work. If it's convenient, I'd like to try your input files, Thanks!

HYPJUDY commented 5 years ago

Hi @fengyang0317 , thanks for your contribution! How many GPUs do you use for your default setting? To train with initialization, I use your suggest command as follows,

python im_caption_full.py --inc_ckpt ~/workspace/ckpt/inception_v4.ckpt\
  --imcap_ckpt saving_imcap/model.ckpt-18000\
  --sae_ckpt sen_gan/model.ckpt-30000 --multi_gpu --batch_size 512\
  --save_checkpoint_steps 1000 --gen_lr 0.001 --dis_lr 0.001

I tried 1/4 gpu(s). With 1 GTX Titan X gpu, I got similar (still worse) performance to yours within 4000 steps (cost ~6h/1kstep).

Test results at 4000 iter
CIDEr: 0.280
Bleu_4: 0.051
Bleu_3: 0.104
Bleu_2: 0.212
Bleu_1: 0.398
ROUGE_L: 0.275
METEOR: 0.122
WMD: 0.085
SPICE: 0.080

But with 4 P100 gpus, I got similar (still worse) performance to yours more than 15000 steps (cost ~1h/1kstep).

Test results at 15000 iter
ratio: 0.855401562006
Bleu_1: 0.390
Bleu_2: 0.218
Bleu_3: 0.111
Bleu_4: 0.057
METEOR: 0.121
ROUGE_L: 0.276
CIDEr: 0.278

Maybe I should tune the learning rate for 4gpu training but I notice (from tensorboard) your training time is about 1.5h/1k step, which is similar to mine 4gpu setting. Do you have any suggestion? FYI, I got similar performance under the without initialization setting with 4 P100 at 34k steps.

WMD: 0.078
CIDEr: 0.228
Bleu_4: 0.043
Bleu_3: 0.090
Bleu_2: 0.196
Bleu_1: 0.373
ROUGE_L: 0.264
METEOR: 0.114
WMD: 0.078
SPICE: 0.065

BTW, do you support multi-gpu evaluation? The evaluation/testing is very time-consuming (1.5h-2.5h) Which phase do you think cost the most time? Thanks!

fengyang0317 commented 5 years ago

@HYPJUDY I used 4 GPUs. I think using batch_size=512 and lr=1e-3 should produce similar results. The second results you posted is good enough to me.

The current eval_all.py supports multi-gpu. To use 4 GPUs, just set CUDA_VISIBLE_DEVICES='0,1,2,3' and --threads=12.

HYPJUDY commented 5 years ago

Hi @fengyang0317 , I tried the multi-gpu and multi-thread setting for evaluation. It seems that the time cost is the same as manually evaluate each model with different gpus in parallel. I mean, the time cost for evaluating four models with CUDA_VISIBLE_DEVICES='0,1,2,3' and --threads=12 is the same as evaluating model A/B/C/D with gpu 0/1/2/3 at the same time. So each evaluation/testing still cannot speed up. The evaluation/testing time (1.5h with 5k images) is bigger than training time (1h/1k step with batch size 512), right? I am confused why.. BTW, why you say "The last element in the b34.json file is the best checkpoint." which means the largest sum of bleu_3 and bleu_4. Why not CIDEr? Thanks!

fengyang0317 commented 5 years ago

@HYPJUDY You are correct. It doesn't speed up a single eval. I may provide an eval file you wanted in the future.

Sometimes, the CIDEr increases and the captioning quality drops. So I use BLEU3+BLEU4 as the metric to choose the best checkpoint.

HYPJUDY commented 5 years ago

Thanks and looking forward to your related files of unpaired setting!

username123062 commented 4 years ago

@fengyang0317 Hi Yang, I try to reproduce the results in table 2, I replace the crawled sentence corpus with the COCO captions, I get a new vocabulary with 11321 words( frequency >= 4 plus category names). Then I use the new word_counts.txt, sentence.pkl and sentence.tfrec, also I change the vocab_size, start_id, and end_id in config.py, everything else stays the same. Without initialization, I trained 40000 epoch, I randomly tested five models from 20000 to 40000 epoch, CIDEr metrics are all around 6. This is very bad. Maybe, the hyperparameters need to be tuned? Can you give me some advice, thank you.

Hi, I also trying to reproduce the results in table 2, could you tell me what is the changes of start_id and end_id in config.py?

fengyang0317 commented 4 years ago

start_id is the line number of \<S> and end_id is the line number of \</S> in the dictionary file.

fengyang0317 / unsupervised_captioning

about the reproduce of metrics reported in paper? #4