Question about training and evaluation of RL System Policy

JamesCao2048 commented 8 months ago

Hi, I have found that RL policy could be trained and evaluated with pipelines with different components occupied. Here are several configurations I found,

Basic setting. UserAct -> SystemRuleDST -> SystemRLPolicy -> SystemAct -> UserRLPolicy -> UserAct.
+SystemNLU. UserUtterance -> SystemNLU/Joint DST -> SystemRLPolicy -> SystemAct -> UserRLPolicy -> UserUtterance
+UserNLU. UserAct -> SystemRuleDST -> SystemRLPolicy -> SystemNLG-> SystemUtterance -> UserNLU -> UserRLPolicy -> UserAct
Full setting. UserUtterance -> SystemNLU/Joint DST -> SystemRLPolicy -> SystemNLG -> SystemUtterance -> UserNLU-> UserRLPolicy-> UserNLG -> UserUtterance.

My questions are:

For system policy training: Should I train the policy with basic setting or full setting? I think if the policy is trained with basic setting, it may not be able to adapt well in full setting for errors from other models (like DST or NLG). However, if the policy is trained with full setting, the training process may converge slowly because other models introduce too much noises, which are difficult for training. Moreover, the trained policy in full setting can only be used with the same models toghether when integrated in the pipeline. In ConvLab3 RL training tutorial, the policy is trained with basic setting.
For system policy evaluation: Should I train the policy with basic setting or full setting? I think if the policy is evaluated with basic setting, it is a component-level policy evaluation, and the evaluation results could not reflect its performance in a full pipeline. If the policy is evaluated with full setting, it is a system-level evaluation and the results are also affected by other components like NLU and NLG.
Do the setting +SystemNLU and +UserNLU meaningful for some training or evaluation cases?

If I have any mistakes, please correct me. Looking forward to your reply, thanks!

ChrisGeishauser commented 8 months ago

Hi @JamesCao2048,

thanks for your comments, you observed everything very well! Let me comment on your questions:

You correctly identified advantages and disadvantages of basic vs full setting. In my research for dialogue policy optimization using RL, I always used the basic setting. As you observed, the basic setting is a "clean" setup without any noise stemming from NLU predictions. This lets you focus on the pure RL algorithm (but without checking the robustness against noise). It also lets you train policies "from scratch", so not necessarily pre-trained on data, where performance differences are easier to observe. Moreover, first training with the basic setting can be a great sanity check to make sure that the RL optimization works properly, before you move towards full training where problems can also stem from the other components. As you said, the full setting has the advantage that you already take the other components into account during RL training. This is preferred if your goal is to have a high performance when interacting with humans for instance. The NLG for instance is typically trained using supervised learning on the dataset. If you train the policy in the basic setting, it will likely learn to take semantic actions that the NLG has not seen in the data, which leads to a distribution shift later on for the NLG and consequently worse performance. Due to this, training both together should lead to better performance with the downside of slower training progress.
I fully agree with you on that. It is not necessarily given that component wise superiority will translate to superiority for the whole system. For instance, dialogue state trackers are often stronger than belief state trackers (which take uncertainty in the predictions into account) in state tracking performance. Nevertheless, belief state trackers can provide more information to the policy through the uncertainty, potentially boosting overall performance of the system.
Let's say you have a user policy that was trained with semantic system actions as input (such as TUS or GenTUS). If you now assume that humans have an almost perfect NLU performance (at least much better than the NLU models), the setting +SystemNLU makes a lot of sense because adding a UserNLU in addition would just introduce additional noise. Regarding +UserNLU, the only use case I can come up with right now is testing the noise robustness of a user simulator :)

I hope my answers were helpful! Let me know if you have further questions.

Best Chris

JamesCao2048 commented 8 months ago

Hi @ChrisGeishauser, Thank you so much for your very very helpful and fast reply! This helps me a lot!!!

I have a follow-up question for the point 3. As you said,

If you now assume that humans have an almost perfect NLU performance (at least much better than the NLU models), the setting +SystemNLU makes a lot of sense because adding a UserNLU in addition would just introduce additional noise.

So if we want to remove the noises introduced by UserNLU, does +SystemNLU is a better evaluation setting than full setting? Because we do not want the errors caused by UserNLU make the dialogue fail.

But for UserRulePolicy, I think the UserNLU is necessary because it seems could not receive system dialogue act direcly?

ChrisGeishauser commented 8 months ago

Hi @JamesCao2048,

awesome, happy to help!

So if we want to remove the noises introduced by UserNLU, does +SystemNLU is a better evaluation setting than full setting? Because we do not want the errors caused by UserNLU make the dialogue fail.

Yes, I think so! But of course this only holds if the user does not take system utterance as input.

But for UserRulePolicy, I think the UserNLU is necessary because it seems could not receive system dialogue act direcly?

The UserRulePolicy should obtain system dialogue acts directly!

JamesCao2048 commented 8 months ago

Hi. @ChrisGeishauser Your reply perfectly answered my questions. I will try UserPolicy with semantic act later. Thanks a lot!!

Best regards, James Cao

ConvLab / ConvLab-3

Question about training and evaluation of RL System Policy #189