DLR-RM / stable-baselines3

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.
https://stable-baselines3.readthedocs.io
MIT License
9.24k stars 1.71k forks source link

[Feature Request] Refactor `predict` method in `BasePolicy` class #385

Closed tfurmston closed 3 years ago

tfurmston commented 3 years ago

🚀 Feature

A clear and concise description of the feature proposal.

At present the predict method in the BasePolicy class contains quite a lot of logic that could be reused to provide similar functionality. In particular, the current logic of the this method is as follows:

  1. Pre-process NumPy observation and convert it into a PyTorch Tensor.
  2. Generate a action(s) from the child policy class through the _predict method, with these actions in the form of a PyTorch Tensor.
  3. Post-process the actions, including converting the PyTorch Tensor into a NumPy array.

My suggestion is that steps (1) and (3) are refactors into individual functions on the BasePolicy class, which are then called in the predict method.

Motivation

I would like to introduce some policy classes for which I can calculate the action probabilities and not the actions themselves. (This is for some work on off-policy estimation that I am doing.)

Let's call this functionality predict_probabilities, then at present the initial logic of this functionality is identical to step (1) of the predict method. If the code is refactored as suggested, then both approaches can use the same pre-processing functionality.

Additionally, I think the refactor would generally make the code more readable and easier to extend parts of functionality to other similar uses.

Pitch

I am happy to do a PR for the proposed refactor, so I would like to know whether or not you would be happy with the proposal.

Alternatives

None

Additional context

None

 Checklist

Miffyli commented 3 years ago

In summary, this would essentially be same as action_probability function in stable-baselines2 (not exactly same, but idea is in same ballpark)?

This would be very much welcome! People have already asked for a similar function. It needs a bit of work as currently _predict function only returns selected actions. I think a simple work-around for this is to do steps you recommended + modifying _predict functions to also return the distribution along with sampled actions, and passing that distribution in the predict_probabilities.

As for splitting (1) and (3) into separate functions: I would keep (3) inside predict function and separate only (1) into a separate function. (3) is only needed by predict.

tfurmston commented 3 years ago

Thanks for the response.

As for splitting (1) and (3) into separate functions: I would keep (3) inside predict function and separate only (1) into a separate function. (3) is only needed by predict.

Yes, this makes sense.

In summary, this would essentially be same as action_probability function in stable-baselines2 (not exactly same, but idea is in same ballpark)?

This would be very much welcome! People have already asked for a similar function. It needs a bit of work as currently _predict function only returns selected actions. I think a simple work-around for this is to do steps you recommended + modifying _predict functions to also return the distribution along with sampled actions, and passing that distribution in the predict_probabilities.

Yes, it would be like this function. This would be a bigger change than I was originally anticipating, as it would not be backward compatible and require change quite a lot of classes. (At least, at first glance this seems to be the case, but maybe I am wrong?)

Originally I was hoping to split out predict as originally described and then use the split out functionality in my sub-classes. Putting the entire functionality in stablebaselines3 does make sense, but I wonder if we could split it into two issues/PRs, with the original splitting as an initial PR. Would that be ok?

araffin commented 3 years ago

I would like to introduce some policy classes for which I can calculate the action probabilities and not the actions themselves. (This is for some work on off-policy estimation that I am doing.)

do you want that feature for any possible input or do you just need it to work for pytorch tensors?

It's true that action_probability was part of SB2, however, in retrospect, it doesn't seem to be of much use... most of the time, users need direct access to the probability distribution, not just numpy output of probability, and for that, you can easily define a custom policy.

tfurmston commented 3 years ago

I do have a custom policy, yes. I am happy for it to be a custom class and not generic to all policies.

However, the initial part of my predict_probabilities function is identical to the initial part of the predict function, hence my desire to do the initial suggested refactor.

After the initial responses it seems that this would just be a matter of moving the code prior to the call of _predict to a separate function in the base policy class. Would you be ok with such a change?

Miffyli commented 3 years ago

It's true that action_probability was part of SB2, however, in retrospect, it doesn't seem to be of much use... most of the time, users need direct access to the probability distribution, not just numpy output of probability, and for that, you can easily define a custom policy.

I think we should allow easy access for this much like in SB2 (we could just pass the distribution object that is already created as part of predict). This would make it easier to debug/study how agent behaves. Maybe this should be part of evaluate_actions to support setups where you do not provide actions (e.g. if no actions are provided, sample actions and return sampled actions and distribution)?

araffin commented 3 years ago

After the initial responses it seems that this would just be a matter of moving the code prior to the call of _predict to a separate function in the base policy class. Would you be ok with such a change?

I think we should allow easy access for this much like in SB2

I would disagree in providing the same feature as in SB2, as in my opinion it will add complexity for little to no usage (and additionally, this feature is not valid for DDPG/TD3). I would agree with separating the code prior to _predict() if it was used somewhere else.

Note: I'm willing to change my mind if we get more requests by users for this feature ;)

This would make it easier to debug/study how agent behaves.

the user can normally easily access the policy.distribution for debug or policy.evaluate_actions in the case of PPO for instance.

EloyAnguiano commented 3 years ago

I think it could be usefull for this kind of models in production. You could use that action probability for asking for help in a human controlled system, for example.

I think it could be very useful to use, as it is in a Classification problem in an Sklearn Model (using the predict_proba method) . Also, I think it could be very helpfull to name it like so to make it easier for people used to that library instead of predict_probabilities.

Miffyli commented 3 years ago

I tend to agree with Eloy here. The usefulness of the probability distribution remains as question for the user, but it could be useful for debugging how agent learns over time. As it stands, getting these probabilities out require more in-depth knowledge of this.

@araffin how would you still feel about this change? I realize it can not be done for all algorithms but that's just how they work. Another good function would be to get value estimates out (but that's for other time).

araffin commented 3 years ago

@araffin how would you still feel about this change? I realize it can not be done for all algorithms but that's just how they work.

Looking back at what was said, I still think we should not provide directly a predict_proba method as it is highly algorithm specific (see DQN vs PPO vs TD3 vs SAC), requires a good understanding of what type of object is being used and can be easily implemented if one understand how an algorithm work.

Additionally, I think the refactor would generally make the code more readable and easier to extend parts of functionality to other similar uses. Note: I'm willing to change my mind if we get more requests by users for this feature ;)

On the other hand, I do agree with @tfurmston that we should provide some function to preprocess the observation (and maybe post-process the result) and allow it to be re-used by the user when implementing the custom predict_proba + add some pointers in the documentation

And for that, we would welcome a PR ;)

ziegenbalg commented 3 years ago

Please please make this happen, it would be so nice to see how 'certain' the neural net is of it's action on each step. I'm working on a wind/tides program for myself to classify environmental conditions into 'go kiting' or 'don't go kiting' :) I know, not how the gym env supposed to be used, but stable baselines just make it so easy to code even I can do it. I don't understand the mathematics enough underneath it to make a PR for you guys. I tried hacking around some print statements to no avail :( (maybe someone has a quick and dirty one for PPO).

EloyAnguiano commented 3 years ago

Please please make this happen, it would be so nice to see how 'certain' the neural net is of it's action on each step. I'm working on a wind/tides program for myself to classify environmental conditions into 'go kiting' or 'don't go kiting' :) I know, not how the gym env supposed to be used, but stable baselines just make it so easy to code even I can do it. I don't understand the mathematics enough underneath it to make a PR for you guys. I tried hacking around some print statements to no avail :( (maybe someone has a quick and dirty one for PPO).

@ziegenbalg Actually you can make a trick for this with PPO. As @Miffyli said above, you can use the evaluate_actions method for the policy object. This example worked for me (I think, maybe @Miffyli sees some error):

# Coverts lists to tensors
states_tensor = th.from_numpy(np.asmatrix(states))
actions_tensor = th.from_numpy(np.ones_like(np.asmatrix(actions))) # Prob of perform action

values, log_prob, entropy = best_model.policy.evaluate_actions(states_tensor,actions_tensor)

probs = np.exp(log_prob.detach().numpy())

Note that im trying to get the probability of perform an action (my action space is Binary at this case), therefore I use the np.ones_like function.

Miffyli commented 3 years ago

@EloyAnguiano 's thing looks correct! Just bare in mind that actions_tensor should correctly reflect what your actions would be like (e.g. if you have a discrete action space, then it should be a one-hot vector with only one "1" and rest zeros). It might not throw errors on wrong inputs (it was not designed to be used outside like this).

ziegenbalg commented 3 years ago

Thanks guys, looking into it. My action space is a Discrete(2), and I have a Dict() observation space. Any tips on how to convert the Dict Observation space to the tensor needed here?

Miffyli commented 3 years ago

I do not have code from top of my head here, so I think your best bet is to print out what the inputs to that function look like and follow that. It might be as simple as dictionary of torch tensors.

araffin commented 3 years ago

I finally had the time to make a PR here: https://github.com/DLR-RM/stable-baselines3/pull/559

I decided to decouple only 1) and 2) as the post-processing is very simple for the action and is used only for predict().

ziegenbalg commented 3 years ago

Right on! I've sort of come up with a different way by looking at the policy code... I noticed the obs_as_tensor function too.

model = PPO.load(filename)
obs = VecEnv.reset()
action, _states = model.predict(obs, deterministic=True)
category = env.action_map[int(action)]
obs = obs_as_tensor(obs, model.policy.device)
latent_pi, _, latent_sde = model.policy._get_latent(obs)
distribution = model.policy._get_action_dist_from_latent(latent_pi, latent_sde)
actions = distribution.get_actions(deterministic=True)
values, log_prob, dis = model.policy.evaluate_actions(obs, actions)
log_prob = log_prob.cpu()
prob = numpy.exp(log_prob.detach().numpy())[0]
print(values, log_prob, dis)
print ("Probability: ", prob)
print ("Category: ", category)

It seems to work, but it's only giving me one probability, which is fine since I only have two actions so I subtract from 1 to get the other. Does that look right?

Btw, do you guys have a patreon page? Stable-baselines is such an excellent project! It's taught me a lot about coding/machine learning and it's so straight forward. Love it!

Miffyli commented 3 years ago

(Sorry, the code formatting was messing up, so removed it...)

Try putting your code ``` like this ```, that should look nice :)

It seems to work, but it's only giving me one probability, which is fine since I only have two actions so I subtract from 1 to get the other. Does that look right?

Kind of. Your code is answering the question "what is the log-probability of the action it chose". You need to inspect the distribution variable if you want to know probability of picking any one of the actions. The exact code depends on your action space, but for Discrete space this would be distribution.distribution.probs (the distribution.distribution object is a pytorch distribution object).

Btw, do you guys have a patreon page? Stable-baselines is such an excellent project! It's taught me a lot about coding/machine learning and it's so straight forward. Love it!

Nope, maintainers are doing this for their free-time and partially for their work :). Best way to contribute back is by giving comments, spotting errors and best of all: doing PRs to update things!

ziegenbalg commented 3 years ago

Well I only have two actions so if it answers the questions of "the log-probability of the action it chose", then I can deduce the other action probability easily.

And alrighty, though I think you guys should consider a patreon page, no reason to give up some free cake on the internet. Though I guess it would add a layer of democracy that's a little more work.

SimoMaestri commented 2 years ago

Is there a way to extract output score for DQN? The method described above only works for PPO.

araffin commented 2 years ago

Is there a way to extract output score for DQN? The method described above only works for PPO.

Duplicate of https://github.com/DLR-RM/stable-baselines3/issues/568

Karlheinzniebuhr commented 1 year ago

Any update to this? I've tried to derive the action like this, but it doesn't take into account the random seed and differs slightly in predictions from the predict() method.

def predict_proba(self, features):
    # Reset environment to initial state and get initial observation
    self.RLEnv = MyForexEnv(df=self.convert_ohlc_camelcase(features), window_size=10, frame_bound=(10, len(features)))
    obs = self.RLEnv.reset()
    obs = obs[np.newaxis, ...]
    # Get the distribution object from the policy
    dis = self.model.policy.get_distribution(torch.from_numpy(obs))
    # Get the probabilities tensor from the distribution object
    probs = dis.distribution.probs
    # Convert the probabilities tensor to a NumPy array and detach it from the computation graph
    probs_np = probs.detach().numpy()[0]
    # calculate the action
    action = np.argmax(probs_np)
    # Return the probabilities array
    return action, probs_np