SforAiDl / genrl

A PyTorch reinforcement learning library for generalizable and reproducible algorithm implementations with an aim to improve accessibility in RL
https://genrl.readthedocs.io
MIT License
405 stars 58 forks source link

Usage explanatory docs #196

Open sampreet-arthi opened 4 years ago

sampreet-arthi commented 4 years ago

Go to the docs/source/usage/tutorials and add separate .md files to explain the following:

When working on this issue, it is important to explain the algorithms as well and not just have what's present in the readme already.

Also add the entry to the turtorial in docs/source/usage/tutorials/index.rst

EDIT: Changed docs/source to docs/source/tutorials EDIT2: Changed docs/source/tutorials to docs/source/usage/tutorials

Devanshu24 commented 4 years ago

Hi! Could you give a brief idea as to what should be included in those files?

sampreet-arthi commented 4 years ago

Things to be included:

Feel free to add anything you think would be helpful for a beginner to understand how to use our repo. We'll iron out the details when you put up a PR.

Darshan-ko commented 4 years ago

I'll work on adding the docs for using a2c if no one has taken it up already?

Darshan-ko commented 4 years ago

When I tried to run A2C on 'CartPole-v0', the policy loss and policy entropy just go to 0 after a few epochs(around 15), and thus the performance just gets stuck in a local optima with approximately the same mean reward there onwards. Is this happening due to vanishing gradients?

sampreet-arthi commented 4 years ago

Is the mean reward dropping? Also run trainer.evaluate for a few episodes to check if the final mean reward is 200.0 or not. Our logger rounds of the loss values. The actual policy_loss can go to very small values like 1e-6 etc. That's alright. (same with entropy)

Darshan-ko commented 4 years ago

Yes, the mean reward is also dropping, starting from around 23.67 and stabilises at around 9.3. trainer.evaluates also shows the same.

sampreet-arthi commented 4 years ago

Does it go to around 160-180 in the middle? It's a known issue that our A2C is unstable and suddenly drops in performance midway. If it's not training at all, you can raise an issue for that separately.

Darshan-ko commented 4 years ago

Oh yes most of the times, it does go to 170-180ish in the middle for 2-3 episodes and then drops sharply. What is the reason for this instability?

sampreet-arthi commented 4 years ago

Yeah, so it's fine for now. Not sure why our A2C collapses all of a sudden. There isn't any problem with the logic.

Devanshu24 commented 4 years ago

I'd like to write the docs for VPG, if that's okay

sampreet-arthi commented 4 years ago

For anyone working on this, please add the files to docs/source/tutorials not docs/source

Darshan-ko commented 4 years ago

I can take up using multi armed bandits and contextual bandits.

Devanshu24 commented 4 years ago

What is the definition of timestep which is displayed on the console during training? Is it the time from the start of training (I think it looked like it), or is it the timestep from the start of an epoch?

Edit: Ok now I am almost certain it is the former, still want to confirm it to be sure

sampreet-arthi commented 4 years ago

Timestep from the very beginning

Devanshu24 commented 4 years ago
import gym

from genrl import VPG
from genrl.deep.common import OnPolicyTrainer
from genrl.environments import VectorEnv

env = VectorEnv("CartPole-v1")
agent = VPG('mlp', env)
trainer = OnPolicyTrainer(agent, env, epochs=1000)
trainer.train()

I tried running VPG with this code, the mean_reward is a maximum of just 409.6, and it continuously then stays there(kinda converges to 409.6). Not sure why exactly that specific number, but I ran it multiple times and it's the same always

Is there a way to get/plot the max_rewards in each epoch sort of a thing?

UPD: I ran it for 5000 epochs and it reaches a max mean_reward of 409.6 by ~2000 epochs and after that it starts going crashing down and goes to 160ish range

sampreet-arthi commented 4 years ago

At the end, add a trainer.evaluate(). That will make sure that the greedy policy is followed each time. Should give 500.

Devanshu24 commented 4 years ago

At the end, add a trainer.evaluate(). That will make sure that the greedy policy is followed each time. Should give 500.

Okay but from what I understand, it'll use the learnt policy, so even if it reached 500 just once during the learning phase it'll be able to achieve a 500 during a greedy eval

But when I ran it for 5000 epochs it doesn't converge to anywhere close to 500 (I may be wrong but this shouldn't happen right?)

Attached the log for reference vpg-genrl.txt

sampreet-arthi commented 4 years ago

What trainer.evaluate() does is it makes sure that whenever an action is selected, the deterministic policy is followed (see the VPG implementation). Not sure specifically about VPG but it's likely that it doesn't always follow a deterministic policy unless it's explicitly set to do so like in evaluate.

Devanshu24 commented 4 years ago

Oh okay, thanks! I'll look into the implementation again. My question was even if it is following a stochastic policy the policy should improve over time(over the course of time of trainer.train) right?

sampreet-arthi commented 4 years ago

Yes it should. But the stochasticity maybe remains the same. That way the agent maybe has already learned the optimal policy but will still continue to explore in the same proportion. It's weird that it gets stuck at 406.9 but it's not a problem if the greedy policy converges.

Devanshu24 commented 4 years ago

Oh okay! I'll have to read up a bit more on this to get a better understanding, thanks!

threewisemonkeys-as commented 4 years ago

I can take up using multi armed bandits and contextual bandits.

@Darshan-ko can do this after #176 is merged (which will mostly happen today). There are a lot of significat changes for bandits in this PR.

Darshan-ko commented 4 years ago

@threewisemonkeys-as Ok cool.

Sharad24 commented 4 years ago

Its merged now, btw

Darshan-ko commented 4 years ago

When we use a cnn for atari envs, we get a feature vector using that cnn on the state representation and then use mlp accordingly on that feature vector right? Do we load a pre-trained model for that cnn? I could not find the code for loading those parameters or even training the cnn from scratch.

threewisemonkeys-as commented 4 years ago

When we use CNNs are feature extractors in RL algorithms, we generally learn the CNN + MLP during training of the RL agent. So when you do loss.backward() for either the policy of the value function, this loss gets propogated all the way back to the CNN and optimises its representation specifically for that agent. So there is no need to train / pretrain it seperately.

Infact it would be pretty hard to train the CNN independatly of the RL agent since you have no labeled data in this scenario.

sampreet-arthi commented 4 years ago

When we use a cnn for atari envs, we get a feature vector using that cnn on the state representation and then use mlp accordingly on that feature vector right? Do we load a pre-trained model for that cnn? I could not find the code for loading those parameters or even training the cnn from scratch.

If you've noticed, then we use architecture "cnn" for CNN architectures for Atari envs. What that does is use a CNNValue from deep/common/values.py

Darshan-ko commented 4 years ago

@threewisemonkeys-as But doesn't it seem counter-intuitive that optimizing the loss function of the policy and value functions can help in learning the CNN parameters? As in is it not possible that the paramters for the features we wish to be extracted from the state do not correspond to the optima of the policy or value functions, if that makes sense?

@sampreet-arthi I got that but I could not find an explicit loss function or loss.backward() for the CNN, so I was confused about the same.

threewisemonkeys-as commented 4 years ago

See your policy pi is supposed to be any function mapping states to actions with parameters. For Deep RL its usually some Neural Network but it can even be a simple linear function which multiplies the state by a certain value (the parameter).

When we do policy optimisation (through policy gradient) we basically look at how the policy is performing and compute gradients in the direction of better performance. These gradients are w.r.t. to the parameters of the policy and we update these parameters accrodingly to make the policy perform better.

Now in the case of Atari, the policy consits of both a CNN and MLP so the parameters of both need to be updated when optimising the policy. Over the course of training, the CNN will eventually learn to give the optimum features required for the policy to do well.

is it not possible that the paramters for the features we wish to be extracted from the state do not correspond to the optima of the policy or value functions, if that makes sense?

No, the features we want should be those that allow the policy to gain the most reward. These features dont really correspond to anything else apart from this.

Darshan-ko commented 4 years ago

No, the features we want should be those that allow the policy to gain the most reward. These features dont really correspond to anything else apart from this.

Oh ok I get it. So if I understood correctly, this is because there is no particular objective for the CNN other than it providing features on which we can train the policy and value function on and find an optimum policy right?

threewisemonkeys-as commented 4 years ago

Yep

Darshan-ko commented 4 years ago

@threewisemonkeys-as Line 85 of trainer.py in genrl.bandit gives an 'AttributeError: 'UCBMABAgent' object has no attribute 'update_db' ' on training a UCBMABAgent agent on BernoulliMAB bandit. I think this is because only contextual bandit agents have the attribute update_db. There is also an 'IndexError: list index out of range' raised on line 91 of trainer.py. I tried fixing the AttributeError by temporarily removing function call of update_db from trainer.py and then modified the arguments in the function call of update_params in line 91 and the training works fine ending with "Final Regret Moving Average: 0.324 | Final Reward Moving Average: 0.676". What would be a better solution to this problem?

threewisemonkeys-as commented 4 years ago

Are you transfering the database update into update_params? Becuase then this would only happen whenever we are updating the params which can be undesirable (i.e. if we are building the db before updating params)

One way to fix this would be to check the type of the bandit using isinstance for calling update_db

threewisemonkeys-as commented 4 years ago

Also, if anyone is workin on adding tutorials, I have done some restructuring of docs in #217

Darshan-ko commented 4 years ago

Are you transfering the database update into update_params? Becuase then this would only happen whenever we are updating the params which can be undesirable (i.e. if we are building the db before updating params)

No I just deleted the update_db part from trainer.py because the agent I was training(UCBMAB) had no update_db attribute. I changed line 91 (function call of update_params) from self.agent.update_params( action, kwargs.get("batch_size", 64), train_epochs ) to self.agent.update_params( context, action, reward ) after checking the update_params args.

One way to fix this would be to check the type of the bandit using isinstance for calling update_db

Oh ok got it.

threewisemonkeys-as commented 4 years ago

Actually sorry, my bad. The trainer is specifically for CB agents. Since mab agents have their own learn methods they don't need an external trainer.

threewisemonkeys-as commented 4 years ago

So no need to change the current trainer. One thing you can look at is making a trainer specifically for mab agents. This would just require repurposing the learn method. This would be good for uniformity and also ensure compatibility with the cli

Darshan-ko commented 4 years ago

So no need to change the current trainer. One thing you can look at is making a trainer specifically for mab agents. This would just require repurposing the learn method. This would be good for uniformity and also ensure compatibility with the cli

Yes I will try doing that, I should add the trainer for mab agents in the main trainer.py file right?

threewisemonkeys-as commented 4 years ago

Yeah

Darshan-ko commented 4 years ago

Also, the context in the multi armed bandits has been set to "tensor". Shouldn't it be "int"?

threewisemonkeys-as commented 4 years ago

By default it is tensor since they will mostly be used with the cb_agents, but if they need to be used with mab_agents then the type should be set to int.

github-actions[bot] commented 4 years ago

Stale issue message

threewisemonkeys-as commented 4 years ago

@sampreet-arthi I think most of these have been completed right?

sampreet-arthi commented 4 years ago

Have saving/loading tutorials been added? And PPO?