Closed Snnzhao closed 1 year ago
Hi @Snnzhao, sorry I saw this question just now. Yes there exist a gap for the same metrics that is because:
For collecting rollout
, policy predictions are not deterministic (for exploration), please check this. Considering that we have multiple policy networks, sampling from distributions may change the conversation trajectories a lot.
For evaluation
, policy predictions are deterministic (for exploitation), please check this. In evaluation phase, we are trying to use the best policies we have so far.
Hope this can address your concern, and please let me know if you have more questions. Thanks!
When training online, there is a big gap between acc for "collecting rollout" and evaluation. This is puzzling, why this happens?