facebookresearch / Pearl

A Production-ready Reinforcement Learning AI Agent Library brought by the Applied Reinforcement Learning team at Meta.
MIT License
2.69k stars 165 forks source link

Question: train_via_uniform_data #65

Closed cryptexis closed 8 months ago

cryptexis commented 9 months ago

Hi everyone,

https://github.com/facebookresearch/Pearl/blob/main/pearl/utils/scripts/cb_benchmark/run_cb_benchmarks.py#L72

here it specifically states that the data has been collected by acting as a uniform policy. In case that was not the case and the policy was picking actions not uniformly at random and we had a log like that. How would this function change ?

cryptexis commented 9 months ago

I think I understood how can I change this. What is suspicious there is that the agent still is fed data points one by one, while previously the batch_size was set to 128. Could you please explain ?

Yonathae commented 9 months ago

Hey @cryptexis, thanks for the question. The cardinality of the labels alphabet can be relatively large (>20). In this code we balanced the data as follows: -) 1/4 of datapoints contain the true label (and get a reward of 1) -) 3/4 of the datapoints we uniformly draw a label (most odds is that its not the true label, and then the reward is 0).

To get a truly uniform data set simply use: action_ind = random.choice(range(action_space.n))

However, since in the UCI dataset the labels cardinality can be < 20 this may deteriorate the performance of the learned reward model due to class imbalance.

Please let us know if there are any additional questions.

cryptexis commented 9 months ago

@Yonathae - thanks for the answer. So if I have a collected data, with rewards click/non_click - I can basically remove this random part and basically sample from my dataset correct ?

Also could you please answer to my question with the batch_size ?

Yonathae commented 9 months ago

@cryptexis yes, exactly. remove it and use action_ind = random.choice(range(action_space.n)) (if i understand correctly, in your case, action_space.n=2 so the dataset should be balanced)

++ yes, in this code we generate the dataset by interacting with the environment, so the data collection procedure is happening sequentially, namely, one by one. When we train the reward model we use batches and, then we set batch_size to be a value of our choice.

cryptexis commented 9 months ago

@cryptexis yes, exactly. remove it and use action_ind = random.choice(range(action_space.n)) (if i understand correctly, in your case, action_space.n=2 so the dataset should be balanced)

++ yes, in this code we generate the dataset by interacting with the environment, so the data collection procedure is happening sequentially, namely, one by one. When we train the reward model we use batches and, then we set batch_size to be a value of our choice.

@Yonathae thanks for the answers... I am still confused about certain things. Let me describe my setup first:

I have 6 different ads to show to the customer (I optimize for conversion value) and reward for me is either CONVERSION_VALUE or 0. And the task is to find the best performing. Each ad has its features in the dataset. I have set it up to have action_space.n=6. Also the dataset I have collected has the following properties: 90% of users are shown only one ad (the default one) and only 10% of the users are shown one of the ads randomly.

Questions:

Yonathae commented 9 months ago

hey @cryptexis,

I believe it would be worthwhile to check out the contextual bandits tutorial:

https://github.com/facebookresearch/Pearl/blob/main/tutorials/contextual_bandits/contextual_bandits_tutorial.ipynb

The challenge in contextual bandits setting is the subtle tradeoff between exploration and exploitation. If we only have offline data then a possible approach is to learn a reward model r(x,a), which can be thought of as solving a regression problem. Then, the final policy we evaluate is the greedy policy with respect to the learned reward model. This is what we implemented in the code.

Regarding your questions:

cryptexis commented 9 months ago

@Yonathae - thank you! Seems clear.

One more question - in benchmarks it seems that for offline_learning exploration_module is set to NoExploration(). Is there a specific reason for that?

Yonathae commented 9 months ago

@cryptexis exploration_module is only affecting the agent.act() procedure, namely, when the agent takes an action and needs to explore. in the offline_learning case we only do training on the offline data and want to evaluate the "greedy" action relatively to the learned reward model. for this reason, we set NoExploration(), which is equivalent to saying "take the greedy actions with respect to the learned reward function.

cryptexis commented 9 months ago

@cryptexis exploration_module is only affecting the agent.act() procedure, namely, when the agent takes an action and needs to explore. in the offline_learning case we only do training on the offline data and want to evaluate the "greedy" action relatively to the learned reward model. for this reason, we set NoExploration(), which is equivalent to saying "take the greedy actions with respect to the learned reward function.

Thanks @Yonathae,

I am confused because of the following reason. Let's say I used online learning with ThompsonSampling as exploration_module. And then I want to deploy the learned policy into a production. My assumption was here that since the policy was trained with the ThompsonSampling it will still act not "greedy" when it is deployed. Meaning that, most of the time it will act according to optimal arm, but if the bell curve is not sufficiently tight around that arm it will use sub-optimal arms once in a while. And that would mean that I will explore non-optimal arms once in a while. Here the big assumption is that I will not feed data to the model once it is deployed.

What you're suggesting with NoExploration, is that there is no way that the deployed policy will explore ever, since it has learned what is the optimal arm from the data.

Or my understanding of deploying the policy is completely wrong and industry standard is always go with the greedy actions as soon as the agent learned the which arm is the optimal ?

Yonathae commented 9 months ago

@cryptexis i wouldnt say there's a standard around this, but in the problem, with no other assumption, that would be the correct way to evaluate the performance of the policy.

if the agent learned a reward function \hat{r}, and it has no other information it can use, a reasonable choice is to choose the greedy action. if you take exploratory actions in evaluation you may degrade the performance for no reason (recall that we dont use this exploratory data in the future).

note that would you want to evaluate the policy that performs, e.g., epsilon-greedy, it would be correct to use the epsilon greedy policy.