Open sampreet-arthi opened 4 years ago
Hi! Could you give a brief idea as to what should be included in those files?
Things to be included:
rollout_size
or using different architectures ("cnn" and "mlp") Feel free to add anything you think would be helpful for a beginner to understand how to use our repo. We'll iron out the details when you put up a PR.
I'll work on adding the docs for using a2c if no one has taken it up already?
When I tried to run A2C on 'CartPole-v0', the policy loss and policy entropy just go to 0 after a few epochs(around 15), and thus the performance just gets stuck in a local optima with approximately the same mean reward there onwards. Is this happening due to vanishing gradients?
Is the mean reward dropping? Also run trainer.evaluate
for a few episodes to check if the final mean reward is 200.0 or not. Our logger rounds of the loss values. The actual policy_loss
can go to very small values like 1e-6 etc. That's alright. (same with entropy)
Yes, the mean reward is also dropping, starting from around 23.67 and stabilises at around 9.3. trainer.evaluates also shows the same.
Does it go to around 160-180 in the middle? It's a known issue that our A2C is unstable and suddenly drops in performance midway. If it's not training at all, you can raise an issue for that separately.
Oh yes most of the times, it does go to 170-180ish in the middle for 2-3 episodes and then drops sharply. What is the reason for this instability?
Yeah, so it's fine for now. Not sure why our A2C collapses all of a sudden. There isn't any problem with the logic.
I'd like to write the docs for VPG, if that's okay
For anyone working on this, please add the files to docs/source/tutorials
not docs/source
I can take up using multi armed bandits and contextual bandits.
What is the definition of timestep
which is displayed on the console during training?
Is it the time from the start of training (I think it looked like it), or is it the timestep from the start of an epoch?
Edit: Ok now I am almost certain it is the former, still want to confirm it to be sure
Timestep from the very beginning
import gym
from genrl import VPG
from genrl.deep.common import OnPolicyTrainer
from genrl.environments import VectorEnv
env = VectorEnv("CartPole-v1")
agent = VPG('mlp', env)
trainer = OnPolicyTrainer(agent, env, epochs=1000)
trainer.train()
I tried running VPG with this code, the mean_reward
is a maximum of just 409.6
, and it continuously then stays there(kinda converges to 409.6). Not sure why exactly that specific number, but I ran it multiple times and it's the same always
Is there a way to get/plot the max_rewards
in each epoch
sort of a thing?
UPD: I ran it for 5000 epochs and it reaches a max mean_reward
of 409.6 by ~2000 epochs and after that it starts going crashing down and goes to 160ish range
At the end, add a trainer.evaluate()
. That will make sure that the greedy policy is followed each time. Should give 500.
At the end, add a trainer.evaluate(). That will make sure that the greedy policy is followed each time. Should give 500.
Okay but from what I understand, it'll use the learnt policy, so even if it reached 500 just once during the learning phase it'll be able to achieve a 500 during a greedy eval
But when I ran it for 5000 epochs it doesn't converge to anywhere close to 500 (I may be wrong but this shouldn't happen right?)
Attached the log for reference vpg-genrl.txt
What trainer.evaluate()
does is it makes sure that whenever an action is selected, the deterministic policy is followed (see the VPG implementation). Not sure specifically about VPG but it's likely that it doesn't always follow a deterministic policy unless it's explicitly set to do so like in evaluate.
Oh okay, thanks! I'll look into the implementation again. My question was even if it is following a stochastic policy the policy should improve over time(over the course of time of trainer.train) right?
Yes it should. But the stochasticity maybe remains the same. That way the agent maybe has already learned the optimal policy but will still continue to explore in the same proportion. It's weird that it gets stuck at 406.9 but it's not a problem if the greedy policy converges.
Oh okay! I'll have to read up a bit more on this to get a better understanding, thanks!
I can take up using multi armed bandits and contextual bandits.
@Darshan-ko can do this after #176 is merged (which will mostly happen today). There are a lot of significat changes for bandits in this PR.
@threewisemonkeys-as Ok cool.
Its merged now, btw
When we use a cnn for atari envs, we get a feature vector using that cnn on the state representation and then use mlp accordingly on that feature vector right? Do we load a pre-trained model for that cnn? I could not find the code for loading those parameters or even training the cnn from scratch.
When we use CNNs are feature extractors in RL algorithms, we generally learn the CNN + MLP during training of the RL agent. So when you do loss.backward()
for either the policy of the value function, this loss gets propogated all the way back to the CNN and optimises its representation specifically for that agent. So there is no need to train / pretrain it seperately.
Infact it would be pretty hard to train the CNN independatly of the RL agent since you have no labeled data in this scenario.
When we use a cnn for atari envs, we get a feature vector using that cnn on the state representation and then use mlp accordingly on that feature vector right? Do we load a pre-trained model for that cnn? I could not find the code for loading those parameters or even training the cnn from scratch.
If you've noticed, then we use architecture "cnn" for CNN architectures for Atari envs. What that does is use a CNNValue
from deep/common/values.py
@threewisemonkeys-as But doesn't it seem counter-intuitive that optimizing the loss function of the policy and value functions can help in learning the CNN parameters? As in is it not possible that the paramters for the features we wish to be extracted from the state do not correspond to the optima of the policy or value functions, if that makes sense?
@sampreet-arthi I got that but I could not find an explicit loss function or loss.backward()
for the CNN, so I was confused about the same.
See your policy pi
is supposed to be any function mapping states to actions with parameters. For Deep RL its usually some Neural Network but it can even be a simple linear function which multiplies the state by a certain value (the parameter).
When we do policy optimisation (through policy gradient) we basically look at how the policy is performing and compute gradients in the direction of better performance. These gradients are w.r.t. to the parameters of the policy and we update these parameters accrodingly to make the policy perform better.
Now in the case of Atari, the policy consits of both a CNN and MLP so the parameters of both need to be updated when optimising the policy. Over the course of training, the CNN will eventually learn to give the optimum features required for the policy to do well.
is it not possible that the paramters for the features we wish to be extracted from the state do not correspond to the optima of the policy or value functions, if that makes sense?
No, the features we want should be those that allow the policy to gain the most reward. These features dont really correspond to anything else apart from this.
No, the features we want should be those that allow the policy to gain the most reward. These features dont really correspond to anything else apart from this.
Oh ok I get it. So if I understood correctly, this is because there is no particular objective for the CNN other than it providing features on which we can train the policy and value function on and find an optimum policy right?
Yep
@threewisemonkeys-as Line 85 of trainer.py
in genrl.bandit gives an 'AttributeError: 'UCBMABAgent' object has no attribute 'update_db' ' on training a UCBMABAgent
agent on BernoulliMAB
bandit.
I think this is because only contextual bandit agents have the attribute update_db
.
There is also an 'IndexError: list index out of range' raised on line 91 of trainer.py
.
I tried fixing the AttributeError by temporarily removing function call of update_db
from trainer.py and then modified the arguments in the function call of update_params
in line 91 and the training works fine ending with "Final Regret Moving Average: 0.324 | Final Reward Moving Average: 0.676".
What would be a better solution to this problem?
Are you transfering the database update into update_params
? Becuase then this would only happen whenever we are updating the params which can be undesirable (i.e. if we are building the db before updating params)
One way to fix this would be to check the type of the bandit using isinstance
for calling update_db
Also, if anyone is workin on adding tutorials, I have done some restructuring of docs in #217
Are you transfering the database update into
update_params
? Becuase then this would only happen whenever we are updating the params which can be undesirable (i.e. if we are building the db before updating params)
No I just deleted the update_db
part from trainer.py
because the agent I was training(UCBMAB) had no update_db
attribute.
I changed line 91 (function call of update_params
) from
self.agent.update_params( action, kwargs.get("batch_size", 64), train_epochs )
to
self.agent.update_params( context, action, reward )
after checking the update_params
args.
One way to fix this would be to check the type of the bandit using
isinstance
for callingupdate_db
Oh ok got it.
Actually sorry, my bad. The trainer
is specifically for CB agents. Since mab agents have their own learn
methods they don't need an external trainer.
So no need to change the current trainer. One thing you can look at is making a trainer specifically for mab agents. This would just require repurposing the learn
method. This would be good for uniformity and also ensure compatibility with the cli
So no need to change the current trainer. One thing you can look at is making a trainer specifically for mab agents. This would just require repurposing the
learn
method. This would be good for uniformity and also ensure compatibility with the cli
Yes I will try doing that, I should add the trainer for mab agents in the main trainer.py file right?
Yeah
Also, the context
in the multi armed bandits has been set to "tensor"
. Shouldn't it be "int"
?
By default it is tensor since they will mostly be used with the cb_agents
, but if they need to be used with mab_agents
then the type should be set to int
.
Stale issue message
@sampreet-arthi I think most of these have been completed right?
Have saving/loading tutorials been added? And PPO?
Go to the
docs/source/usage/tutorials
and add separate.md
files to explain the following:When working on this issue, it is important to explain the algorithms as well and not just have what's present in the readme already.
Also add the entry to the turtorial in
docs/source/usage/tutorials/index.rst
EDIT: Changed
docs/source
todocs/source/tutorials
EDIT2: Changeddocs/source/tutorials
todocs/source/usage/tutorials