How to use rl-ppo example

usbalbin commented 1 year ago

Hi!

Sorry if this is the wrong place for this question.

I am new to machine learning, have played around a bit with supervised machine learning following along some MNIST a tutorial couple of years ago. With that in mind, sorry for all the dumb questions. This time around I thought I'd try reinforcement learning.

Trying to get started experimenting with reinforcement learning I imagine setting up some very simple environment like a simulated thermostat or something to train the network on. If I understand things correctly, I am supposed to provide some kind of reward every once in a while depending on how we are doing.

For example a system like this:

struct State {
    heater_power: f32,
    temp: f32
}

// Step simulation one tick
fn simulate(last_state: State, action: f32) -> State {
    let next_state = State {
        // Next temperature depends on previous
        // heater temp to simulate some delay, to make it
        // slightly more challenging.
        temp: last_state.temp + last_state.heater_power,

        // action represent how much power to send to the heater
        heater_power: action,
    }
    next_state
}

fn reward(temperature: f32) -> f32 {
    20.0-(temperature - 20.0).abs() // Highest reward at 20 degrees, lower the further away
}

With examples/rl-ppo.rs as base, where would I put in my logic?

I assume the state is the state of the environment, so the input. Which one is the output, in my case which variable would represent the heater power? Could it be logits, or is it action? However action seems to get its value before everything so probably not? Where do I put in my reward calculation? Also for other situations where the reward might not be known until the end of the run, is there anything specifc to keep in mind?

Again sorry for all the questions

coreylowman commented 1 year ago

Hey no worries at all. Yeah the rl examples are currently not set up with environment interaction, they are really only focusing on the core training loop.

I'd recommend checking out some python implementations of the RL training setup. A good one I've used before is minimalRL, and you can find their PPO implementation at https://github.com/seungeunrho/minimalRL/blob/master/ppo.py. The dfdx example code really only implements a small portion of the logic starting from line 75-85 (https://github.com/seungeunrho/minimalRL/blob/master/ppo.py#L75).

Another great resources is openai's spinningup site: https://spinningup.openai.com/en/latest/. It is like minimalRL but includes a ton of useful background information on RL that will be helpful for you.

I'd also recommend checking out this blog post https://monadmonkey.com/bevy-dfdx-and-the-classic-cart-pole, which does a lot of what you're asking, but for a dqn.

One complication for you is that it seems like you want real valued actions (how much power to send to the heater is a f32). This is a disconnect from the current rl-ppo example as it expects discrete actions (basically an enumeration). minimalRL also has a continuous action PPO example here https://github.com/seungeunrho/minimalRL/blob/master/ppo-continuous.py. The main difference is how they calculate the log probability of the action:

                    mu, std = self.pi(s, softmax_dim=1)
                    dist = Normal(mu, std)
                    log_prob = dist.log_prob(a)
                    ratio = torch.exp(log_prob - old_log_prob)  # a/b == exp(log(a)-log(b))

Honestly all the questions you are asking are good questions. One of the reasons RL is so hard is because there are so many decision points (as you've noticed), where each choice can make a big difference in final agent.

usbalbin commented 1 year ago

Thanks a lot! I will certainly give it a try :)

coreylowman commented 1 year ago

Sure thing! Will close for now - either re-open or start a new issue if you have more questions!

usbalbin commented 1 year ago

Would it be ok if I started a Draft PR for the beginning of my attempt at an examples/rl-ppo-continous.rs?

coreylowman commented 1 year ago

Sure!

coreylowman / dfdx

How to use rl-ppo example #387