Open vedantbhatia opened 3 years ago
Hi @vedantbhatia, thank you for your question.
When you specify the different parameters, it will use one of the two existing policies: the "reference" policy and the current "learning" policy based on the model in VW. The "reference" policy is generated depending on how the search_task is configured, but you should think of it as the oracle or trajectory policy. If you are training based on logged data, it will be the implicit policy of the logs, and if you are using a custom task via the hook task mechanism, you will be specifying the oracle action directly at each state.
The parameters search_rollin
and search_rollout
specify how to choose between the reference/oracle policy and the learned policy, or some mixture of the two:
Roll* Input | Equivalents | Description |
---|---|---|
ref |
oracle |
Use actions provided by the reference (logged or imitation-target) policy |
learned |
policy |
Use actions provided by the model |
mix_per_roll |
mix |
Choose a policy randomly at the beginning of each trajectory considered |
mix_per_state |
Choose a policy randomly every time an action is required |
@olgavrou I would like to work on this. Could you please guide me on how to move forward for this?
Description
Can I please have an example for how to use the search_rollin and search rollout parameters, especially for LOLS? Where and how do I supply these policies?
Link to Documentation Page