Closed sherlock1987 closed 3 years ago
For WarmupDQN, you should run "rule_wdqn_bootstrap_memory" instead of "rule_wdqn". The commented code is used to load expert dialogue tuples to the expert buffer.
Hey, thanks for your reply. Did you run command rule_ppo when you testing your idea on PPO?
Hey, when I run rule_ppo, I found another bug, it is caused by model difference from loading reward_agent.AIRL and real model airl_pretrain.mdl. The code is in ppo.py.
` self.discriminator = reward_agent.AIRL(use_gpu, train_feed, 64)
disc_mdl = './reward_model/airl_pretrain.mdl'
if os.path.exists(disc_mdl):
self.discriminator.load_state_dict(torch.load(disc_mdl))
print("successfully loaded the pretrained Disc model")`
And when running command self.discriminator = AIRL
, it will caused this error. The error message is as belows:
RuntimeError: Error(s) in loading state_dict for AIRL: Missing key(s) in state_dict: "model_g.0.weight", "model_g.0.bias", "model_g.2.weight", "model_g.2.bias", "model_h.0.weight", "model_h.0.bias", "model_h.2.weight", "model_h.2.bias". Unexpected key(s) in state_dict: "model_g.3.weight", "model_g.3.bias", "model_g.1.weight", "model_g.1.bias", "model_h.3.weight", "model_h.3.bias", "model_h.1.weight", "model_h.1.bias".
you don't need to load a pre-trained disc model for AIRL. During the warmup step, the policy model and reward model of AIRL will be pre-trained consecutively. You can take a look at the FLAGs I defined in the code, such as self.pretrain_finished, self.pretrain_disc_and_valud_finished.
Thanks for your response, I get it!!
Hey, thanks for your reply, the ppo algorithmn of your method works pretty good, achieve the same score in your paper. However, for method "Human", "Disc", "AIRL", these method still have a very poor performance. I have check the code thoroughly, but sill got no clue. I believe this is not the issue of PPO paramaters, and actually the log file you send me is same as your current paramaters, which means there is nearly no change when you run AIRL.
Appreciate your help, I think this bugs happens because some little issue, like training_flag.etc. Have a good day!
Hello, I am wondering does PPOSIL work? since I encountered some keyError(missing some configuration) when I try to run pporil. or PPO + GAIL(AIRL) also is rule_ppo ?
Hello, I am wondering does PPOSIL work? since I encountered some keyError(missing some configuration) when I try to run pporil. or PPO + GAIL(AIRL) also is rule_ppo ?
still rule_ppo.
Hello, I am wondering does PPOSIL work? since I encountered some keyError(missing some configuration) when I try to run pporil. or PPO + GAIL(AIRL) also is rule_ppo ?
still rule_ppo.
Thank u very much.
Hey, thanks for your reply, the ppo algorithmn of your method works pretty good, achieve the same score in your paper. However, for method "Human", "Disc", "AIRL", these method still have a very poor performance. I have check the code thoroughly, but sill got no clue. I believe this is not the issue of PPO paramaters, and actually the log file you send me is same as your current paramaters, which means there is nearly no change when you run AIRL.
Appreciate your help, I think this bugs happens because some little issue, like training_flag.etc. Have a good day!
For PPO(human), just use the original reward value in the batch and do not call function `replace_reward_batch'. The other parts are the same as PPO(GAN-VAE). These are some log files from PPO(human), AIRL, GAIL. You can take a loot at the hyper-parameters. According to my observations, I don't think GAIL and AIRL can work without teacher forcing. So basically you need to call ''self.imitate_train()" every several frames to stabilize the adversarial process. Otherwise, these two methods will quickly drop to a very low success rate (around 0.3) regardless of the success rate after warmup.
Hello, I am wondering does PPOSIL work? since I encountered some keyError(missing some configuration) when I try to run pporil. or PPO + GAIL(AIRL) also is rule_ppo ?
still rule_ppo.
Thank u very much.
U are welcome.
sorry, may I ask what should I do if I do not wanna pretrained reward model but just train ppo and gail alternatively? since the code requires the pretrained reward model and I wanna see the performance to train gail and ppo at the same time.
sorry, may I ask what should I do if I do not wanna pretrained reward model but just train ppo and gail alternatively? since the code requires the pretrained reward model and I wanna see the performance to train gail and ppo at the same time.
Comment line 374 to line 381, you get PPO(human). You need to load a pre-trained policy net (line 183) to get reasonable performance with PPO(human). Set self.reward_type = 'DISC' and bring back line 420, you get GAIL with teacher forcing; comment line 420, you get GAIL without teacher forcing. The vanilla GAIL may not work because of unstable training.
sorry, may I ask what should I do if I do not wanna pretrained reward model but just train ppo and gail alternatively? since the code requires the pretrained reward model and I wanna see the performance to train gail and ppo at the same time.
Comment line 374 to line 381, you get PPO(human). You need to load a pre-trained policy net (line 183) to get reasonable performance with PPO(human). Set self.reward_type = 'DISC' and bring back line 420, you get GAIL with teacher forcing; comment line 420, you get GAIL without teacher forcing. The vanilla GAIL may not work because of unstable training.
yep, I noticed that I need to load a pre-trained policy net (line 183), but what should I do to get it because I wanna try different settings and dims for policy learning and the provided pre-trained policy model is not suitable. so I am wondering can I train this myself and set load_pretrained_policy as false?
yes, you can pre-train a policy with your setting. You just need to call self.algorithm.pretrain()
(line 222, agent->init.py) before you load the RL part. You can also do this by call self.imitate_train()
in file ppo.py. After the pre-train, save the policy net (line 359, ppo.py).
yes, you can pre-train a policy with your setting. You just need to call
self.algorithm.pretrain()
(line 222, agent->init.py) before you load the RL part. You can also do this by callself.imitate_train()
in file ppo.py. After the pre-train, save the policy net (line 359, ppo.py).
cool, when I am trying to modify some code, I find there is not data preprocess code for test.sa_alldomain_withnext.json or train.sa_alldomain_withnext.json and so on. I am wondering can u provide these code? also can u explain more about these file? such as why we need state_convlab, state_onehot, and so on?
yes, you can pre-train a policy with your setting. You just need to call
self.algorithm.pretrain()
(line 222, agent->init.py) before you load the RL part. You can also do this by callself.imitate_train()
in file ppo.py. After the pre-train, save the policy net (line 359, ppo.py).cool, when I am trying to modify some code, I find there is not data preprocess code for test.sa_alldomain_withnext.json or train.sa_alldomain_withnext.json and so on. I am wondering can u provide these code? also can u explain more about these file? such as why we need state_convlab, state_onehot, and so on?
To extract expert tuples from the MultiWOZ data, you can take a look at modules/policy/system/multiwoz/vanilla_mle/dataset_reader.py
. state_convlab
is the original state representation from ConvLab. state_onehot
is another format for the same state and I convert it to one-hot representation based on the meaning of each position in state_convlab
. You can ignore state_onehot
if you just want to use the original ConvLab representations.
is this wrong? should return loss?
is this wrong? should return loss?
You can change it to loss
. Since all the gradients have been updated already and it will not affect the algorithm itself. I just use it to check if the loss is decreasing.
sorry to bother again. when I try to run my own experiments, I find that the avg success is around 0.2. so I checked the code. May I ask what should I do if I just wanna alternatively train ppo and gail to get performance as your paper and without using pre trained reward model?
sorry to bother again. when I try to run my own experiments, I find that the avg success is around 0.2. so I checked the code. May I ask what should I do if I just wanna alternatively train ppo and gail to get performance as your paper and without using pre trained reward model?
At lines 374 and 375, the human reward in the sampled batch will be replaced with the provided reward (from the pre-trained model). The easiest way is to comment these two lines and you will get PPO. For GAIL, you should set the reward type (line 376) to 'DISC'. But for both methods, you need to slightly warm up the policy net before load the RL part. You can find more details in the posts above.
should I comment this line since I do not get the reward model before I run the code?
The avg success reaches 0.8 only after 20k frames, unlike the original paper. I set reward type as DISC and load_pretrained_policy as false.
The avg success reaches 0.8 only after 20k frames, unlike the original paper. I set reward type as DISC and load_pretrained_policy as false.
The possible reason is that you have warmed up your policy net too many times (based on the image you shared, GAIL is not working yet because of the low Disc reward). You need to make sure all models have the same starting point. It's better to have an overall image of ConvLab before you modify the details. Be careful of some flags, such as self.disc_pretrain_finished
, self.pretrain_finished
, and functions, such as self.imitate_train()
, self.pretrain()
. There is a log file for GAIL from my training. You can take a look at it if you have time.
When I try to run the wdqn codes, I find some errors in dqn.py code, in function WarmDQN(base).init(), and it fail to train wdqn.
if self.memory_spec['warmup_memory_path']!='': import pickle self.body.warmup_memory = pickle.load(open(self.memory_spec['warmup_memory_path'], 'rb'))
Since in demo.json, there is no paramater warmup_memory_path, so I just common this code, and it is running now. Am I doing correct?