Closed usbalbin closed 1 year ago
Hey no worries at all. Yeah the rl examples are currently not set up with environment interaction, they are really only focusing on the core training loop.
I'd recommend checking out some python implementations of the RL training setup. A good one I've used before is minimalRL, and you can find their PPO implementation at https://github.com/seungeunrho/minimalRL/blob/master/ppo.py. The dfdx example code really only implements a small portion of the logic starting from line 75-85 (https://github.com/seungeunrho/minimalRL/blob/master/ppo.py#L75).
Another great resources is openai's spinningup site: https://spinningup.openai.com/en/latest/. It is like minimalRL but includes a ton of useful background information on RL that will be helpful for you.
I'd also recommend checking out this blog post https://monadmonkey.com/bevy-dfdx-and-the-classic-cart-pole, which does a lot of what you're asking, but for a dqn.
One complication for you is that it seems like you want real valued actions (how much power to send to the heater is a f32). This is a disconnect from the current rl-ppo example as it expects discrete actions (basically an enumeration). minimalRL also has a continuous action PPO example here https://github.com/seungeunrho/minimalRL/blob/master/ppo-continuous.py. The main difference is how they calculate the log probability of the action:
mu, std = self.pi(s, softmax_dim=1)
dist = Normal(mu, std)
log_prob = dist.log_prob(a)
ratio = torch.exp(log_prob - old_log_prob) # a/b == exp(log(a)-log(b))
Honestly all the questions you are asking are good questions. One of the reasons RL is so hard is because there are so many decision points (as you've noticed), where each choice can make a big difference in final agent.
Thanks a lot! I will certainly give it a try :)
Sure thing! Will close for now - either re-open or start a new issue if you have more questions!
Would it be ok if I started a Draft PR for the beginning of my attempt at an examples/rl-ppo-continous.rs
?
Sure!
Hi!
Sorry if this is the wrong place for this question.
I am new to machine learning, have played around a bit with supervised machine learning following along some MNIST a tutorial couple of years ago. With that in mind, sorry for all the dumb questions. This time around I thought I'd try reinforcement learning.
Trying to get started experimenting with reinforcement learning I imagine setting up some very simple environment like a simulated thermostat or something to train the network on. If I understand things correctly, I am supposed to provide some kind of reward every once in a while depending on how we are doing.
For example a system like this:
With examples/rl-ppo.rs as base, where would I put in my logic?
I assume the
state
is the state of the environment, so the input. Which one is the output, in my case which variable would represent the heater power? Could it belogits
, or is itaction
? However action seems to get its value before everything so probably not? Where do I put in my reward calculation? Also for other situations where the reward might not be known until the end of the run, is there anything specifc to keep in mind?Again sorry for all the questions