cszmli / Dp-without-Adv

Guided dialogue policy learning without adversarial learning in the loop
9 stars 4 forks source link

How to reproduce the performance match the paper? #7

Closed ruleGreen closed 3 years ago

ruleGreen commented 3 years ago

Hello, I am confused about the different settings with different methods. I find that the converging speed with the various methods is different from your paper. Can u share more details about the experiments? like which method needs a warmup? and how do u set the max warmup epoch?

ruleGreen commented 3 years ago

image

I checked the code and near no change but only set reward type as DISC and I checked the provided log, most also under 0.2 when reaching 0.6~0.8 success rate.

cszmli commented 3 years ago
  1. All PPO methods need warm-up as I mentioned in the paper.
  2. When to stop warm-up? it all depends on you. Slightly pre-train your policy net can help the RL training. It's almost impossible to train the PPO from scratch in ConvLab. But if you warm up your model too much, you may not be able to distinguish the performance of different RL methods.
  3. To make GAIL and AIRL work, you need teacher-forcing during RL training. And GAIL and AIRL are not my contributions, they are just baselines.
  4. By the way, the results have been reproduced by others already.
ruleGreen commented 3 years ago

Thank u for your quick response. Like I posted above, I find that PPO + GAIL works better than the paper, not worse. and I do not change anything except set the reward type as DISC. so I am a little confused. and wondering is there something wrong? At the last, Thank u for your work. nb

cszmli commented 3 years ago

The performances of GAIL and AIRL are quite tricky. They all rely on TF to achieve relatively high success rates. But as I mentioned in the paper, TF without RL can achieve higher scores than GAIL and AIRL. This is also why we want to get rid of adversarial RL.

ruleGreen commented 3 years ago

Yes, You are right. But I am supposed even with TF, PPO + GAIL can not get such high success rate within 10k frames, like 0.6-0.7 success rate.

ruleGreen commented 3 years ago

For PPO(human), I set the reward type as OFFGAN, and self.pretrain_finished will be False(row 176 ppo.py), and the log should appear warm up value net, but why your log does not appear this one?

cszmli commented 3 years ago

Sorry, I didn't really get your question. Do you mean why I didn't record the values of the value-net in the shared logs? The experiments for PPO(human) and PPO(offgan) were finished early, and we didn't have a value-net warm-up for these two agents. This was also the setup for GAIL and AIRL at the beginning. But we found it was almost impossible to have stable training for GAIL and AIRL. Then we incorporate this value-net warmup for these two agents during the first 2k frames. Since the only difference is if we warm-up the value-net for the first 2k frames, this will not affect the overall performance. Of course, you can warm-up the value net for PPO(human) and PPO(offgan) as well, just like the current code.