Open hugddygff opened 5 years ago
The parameters are shown in the code or training commands. If you are training without initialization, you should train at least 20000 iterations.
By the way, the checkpoints are released in google drive.
@fengyang0317 Thanks for your quick reply, I used the model-10000 which you provided and obtained the results in the paper. Is this model using obejct2sent initialization? I tried the model-20000 (trained by myself) without initialization and got 5 of CIDEr, which is really bad. Can you help me?
Are you training using the following command?
python im_caption_full.py --inc_ckpt ~/workspace/ckpt/inception_v4.ckpt\
--multi_gpu --batch_size 512 --save_checkpoint_steps 1000\
--gen_lr 0.001 --dis_lr 0.001
In Google drive, saving folder contains the checkpoints with initialization, no_init folder contains the checkpoints without initialization.
yes, I just download the code without polishing it. But I use batch_size 256 without multi-gpu, is this important?
I think the batch_size 256 is ok. You may need to reduce the learning rate too. Can you try 5e-4?
Thanks, I will try 5e-4.
------------------ 原始邮件 ------------------ 发件人: "Yang Feng"notifications@github.com; 发送时间: 2019年6月28日(星期五) 中午11:10 收件人: "fengyang0317/unsupervised_captioning"unsupervised_captioning@noreply.github.com; 抄送: "旋风小子"1025503986@qq.com;"Author"author@noreply.github.com; 主题: Re: [fengyang0317/unsupervised_captioning] about the reproduce ofmetrics reported in paper? (#4)
I think the batch_size 256 is ok. You may need to reduce the learning rate too. Can you try 5e-4?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.
and, I found a small problem. The size of the my trained model/graph file is a little different from yours as below. yours ours(I directly run the command. )
Are you using tensorflow==1.13.1?
I use tf 1.12, but without problems of training and evaluation.
1.12 should be fine. I used some earlier versions in 2018.
@fengyang0317 Thanks for your patient help, I solve this problem successfully.
How did you solve the problem? Someone else is facing similar issues.
@fengyang0317 When I trained for second time , everything is OK. My analysis is as below, maybe it's wrong. I found that the test metrics are very related with the training loss. And although the number of training epochs is big enough, there will be noisy loss i.e. negative number(the ordinary loss is about 2.5) . So when I tested for the model at the first time, the loss is negative exactly.
What means trained for second time, training from scratch? It just increases the training time, how it get over the impact of negative loss.
I inspect the rewards in Tensorboard to see whether the training is healthy. The event files are also provided in Google Drive.
@songpipi I means that I delete the models and rerun the code again without any changes. I advice that you can test more models, i.e. 20000,21000,22000...., only one may have problem.
@songpipi since only one model may deviate from stable value. And another advice is to delete the models and rerun the image_cap_full.py again.
@songpipi the stable value = policy rewards. The name in the tensorboard is loss, and the corresponding name in the file the author provided is loss1.
Sorry I still haven't solved this problem, I eval all 30000 model , all is bad. The only thing I changed was using unpaired caption, and I masked the code about glove embedding in process_descriptions.py. I don't know how to troubleshoot, can you give me some advice, please?
@songpipi You can first retrain the model for 20000 steps and observe the performances (do not mask the code of glove embedding). And what is meaning of "The only thing I changed was using unpaired caption", the default setting of the author is using unpaired crawled captions.
I use the unpaired COCO captions, I want to reproduce the results of Table 2 in the paper. So word_counts and sentence.tfrec is different from that provided by author. Now I'm trying to find problem from loss values, I find loss can be negative at terminal.
Could you first try to reproduce the results in table 1?
@fengyang0317 Yes,now I can roughly reproduce the results in table 1. Unpaired setting hasn't worked yet. Apart from the input data, are there any differences between the two experiments?
The difference lays in the processing. You need to generate a new word_count file and change the subsequent files.
How can I get the new glove_vocab.pkl?
songpp beta
邮箱:beta.songpp@gmail.com
Signature is customized by Netease Mail Master
On 07/06/2019 01:04, Yang Feng wrote: The difference lays in the processing. You need to generate a new word_count file and change the subsequent files.
—You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub, or mute the thread. [ { "@context": "http://schema.org", "@type": "EmailMessage", "potentialAction": { "@type": "ViewAction", "target": "https://github.com/fengyang0317/unsupervised_captioning/issues/4?email_source=notifications\u0026email_token=AH7D6HCVBSLVCJBAGLRQALTP555JNA5CNFSM4H3JH2I2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZJ64UI#issuecomment-508816977", "url": "https://github.com/fengyang0317/unsupervised_captioning/issues/4?email_source=notifications\u0026email_token=AH7D6HCVBSLVCJBAGLRQALTP555JNA5CNFSM4H3JH2I2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZJ64UI#issuecomment-508816977", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { "@type": "Organization", "name": "GitHub", "url": "https://github.com" } } ]
You don't need glove_vocab.pkl in the unpaired setting. Just use all the words appearing more than four times in MSCOCO training set.
hi yang, I want to ask the performance without object reward, are the metrics much worse than that of the method with object reward. If possible, can you share some results on coco test. thanks------------------ 原始邮件 ------------------ 发件人: "Yang Feng"notifications@github.com 发送时间: 2019年7月6日(星期六) 中午11:01 收件人: "fengyang0317/unsupervised_captioning"unsupervised_captioning@noreply.github.com; 抄送: "kas-one"1025503986@qq.com;"State change"state_change@noreply.github.com; 主题: Re: [fengyang0317/unsupervised_captioning] about the reproduce ofmetrics reported in paper? (#4)
You don't need glove_vocab.pkl in the unpaired setting. Just use all the words appearing more than four times in MSCOCO training set.
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or mute the thread.
@hugddygff I have not tried to train the model you mentioned. I will train one a few months later and share the results.
@hugddygff Hi, have you tried to reproduce the results of table 2? I had some troubles reproducing the results.
@fengyang0317 Hi Yang, I try to reproduce the results in table 2, I replace the crawled sentence corpus with the COCO captions, I get a new vocabulary with 11321 words( frequency >= 4 plus category names). Then I use the new word_counts.txt, sentence.pkl and sentence.tfrec, also I change the vocab_size, start_id, and end_id in config.py, everything else stays the same. Without initialization, I trained 40000 epoch, I randomly tested five models from 20000 to 40000 epoch, CIDEr metrics are all around 6. This is very bad. Maybe, the hyperparameters need to be tuned? Can you give me some advice, thank you.
In fact, the hyperparameters reported in the paper are the same for both datasets. I will provide the input files and checkpoints for MSCOCO later.
@fengyang0317 Sorry, it's still unresolved, I try set w_mse=0.2 for COCO captions, but it also doesn't work. If it's convenient, I'd like to try your input files, Thanks!
Hi @fengyang0317 , thanks for your contribution! How many GPUs do you use for your default setting? To train with initialization, I use your suggest command as follows,
python im_caption_full.py --inc_ckpt ~/workspace/ckpt/inception_v4.ckpt\
--imcap_ckpt saving_imcap/model.ckpt-18000\
--sae_ckpt sen_gan/model.ckpt-30000 --multi_gpu --batch_size 512\
--save_checkpoint_steps 1000 --gen_lr 0.001 --dis_lr 0.001
I tried 1/4 gpu(s). With 1 GTX Titan X gpu, I got similar (still worse) performance to yours within 4000 steps (cost ~6h/1kstep).
Test results at 4000 iter
CIDEr: 0.280
Bleu_4: 0.051
Bleu_3: 0.104
Bleu_2: 0.212
Bleu_1: 0.398
ROUGE_L: 0.275
METEOR: 0.122
WMD: 0.085
SPICE: 0.080
But with 4 P100 gpus, I got similar (still worse) performance to yours more than 15000 steps (cost ~1h/1kstep).
Test results at 15000 iter
ratio: 0.855401562006
Bleu_1: 0.390
Bleu_2: 0.218
Bleu_3: 0.111
Bleu_4: 0.057
METEOR: 0.121
ROUGE_L: 0.276
CIDEr: 0.278
Maybe I should tune the learning rate for 4gpu training but I notice (from tensorboard) your training time is about 1.5h/1k step, which is similar to mine 4gpu setting. Do you have any suggestion? FYI, I got similar performance under the without initialization setting with 4 P100 at 34k steps.
WMD: 0.078
CIDEr: 0.228
Bleu_4: 0.043
Bleu_3: 0.090
Bleu_2: 0.196
Bleu_1: 0.373
ROUGE_L: 0.264
METEOR: 0.114
WMD: 0.078
SPICE: 0.065
BTW, do you support multi-gpu evaluation? The evaluation/testing is very time-consuming (1.5h-2.5h) Which phase do you think cost the most time? Thanks!
@HYPJUDY I used 4 GPUs. I think using batch_size=512 and lr=1e-3 should produce similar results. The second results you posted is good enough to me.
The current eval_all.py supports multi-gpu. To use 4 GPUs, just set CUDA_VISIBLE_DEVICES='0,1,2,3' and --threads=12.
Hi @fengyang0317 , I tried the multi-gpu and multi-thread setting for evaluation. It seems that the time cost is the same as manually evaluate each model with different gpus in parallel.
I mean, the time cost for evaluating four models with CUDA_VISIBLE_DEVICES='0,1,2,3' and --threads=12
is the same as evaluating model A/B/C/D with gpu 0/1/2/3 at the same time. So each evaluation/testing still cannot speed up.
The evaluation/testing time (1.5h with 5k images) is bigger than training time (1h/1k step with batch size 512), right? I am confused why..
BTW, why you say "The last element in the b34.json file is the best checkpoint." which means the largest sum of bleu_3 and bleu_4. Why not CIDEr?
Thanks!
@HYPJUDY You are correct. It doesn't speed up a single eval. I may provide an eval file you wanted in the future.
Sometimes, the CIDEr increases and the captioning quality drops. So I use BLEU3+BLEU4 as the metric to choose the best checkpoint.
Thanks and looking forward to your related files of unpaired setting!
@fengyang0317 Hi Yang, I try to reproduce the results in table 2, I replace the crawled sentence corpus with the COCO captions, I get a new vocabulary with 11321 words( frequency >= 4 plus category names). Then I use the new word_counts.txt, sentence.pkl and sentence.tfrec, also I change the vocab_size, start_id, and end_id in config.py, everything else stays the same. Without initialization, I trained 40000 epoch, I randomly tested five models from 20000 to 40000 epoch, CIDEr metrics are all around 6. This is very bad. Maybe, the hyperparameters need to be tuned? Can you give me some advice, thank you.
Hi, I also trying to reproduce the results in table 2, could you tell me what is the changes of start_id and end_id in config.py?
start_id is the line number of \<S> and end_id is the line number of \</S> in the dictionary file.
Hi Yang, I am reproducing the experiments in the paper. Can you give some hints of the best combination of hyper-parameters? Such as batch_size, learning_rate, and how many steps is most suitable? I am now using the default parameters in the paper, and training for 7000 steps, obtaining 6.1 of CIDEr, which is far away from the 20+ in the paper. Can you give som help to me?
Thanks very much!