Closed tfurmston closed 3 years ago
In summary, this would essentially be same as action_probability
function in stable-baselines2 (not exactly same, but idea is in same ballpark)?
This would be very much welcome! People have already asked for a similar function. It needs a bit of work as currently _predict
function only returns selected actions. I think a simple work-around for this is to do steps you recommended + modifying _predict
functions to also return the distribution along with sampled actions, and passing that distribution in the predict_probabilities
.
As for splitting (1) and (3) into separate functions: I would keep (3) inside predict
function and separate only (1) into a separate function. (3) is only needed by predict
.
Thanks for the response.
As for splitting (1) and (3) into separate functions: I would keep (3) inside predict function and separate only (1) into a separate function. (3) is only needed by predict.
Yes, this makes sense.
In summary, this would essentially be same as action_probability function in stable-baselines2 (not exactly same, but idea is in same ballpark)?
This would be very much welcome! People have already asked for a similar function. It needs a bit of work as currently _predict function only returns selected actions. I think a simple work-around for this is to do steps you recommended + modifying _predict functions to also return the distribution along with sampled actions, and passing that distribution in the predict_probabilities.
Yes, it would be like this function. This would be a bigger change than I was originally anticipating, as it would not be backward compatible and require change quite a lot of classes. (At least, at first glance this seems to be the case, but maybe I am wrong?)
Originally I was hoping to split out predict
as originally described and then use the split out functionality in my sub-classes. Putting the entire functionality in stablebaselines3 does make sense, but I wonder if we could split it into two issues/PRs, with the original splitting as an initial PR. Would that be ok?
I would like to introduce some policy classes for which I can calculate the action probabilities and not the actions themselves. (This is for some work on off-policy estimation that I am doing.)
do you want that feature for any possible input or do you just need it to work for pytorch tensors?
It's true that action_probability
was part of SB2, however, in retrospect, it doesn't seem to be of much use... most of the time, users need direct access to the probability distribution, not just numpy output of probability, and for that, you can easily define a custom policy.
I do have a custom policy, yes. I am happy for it to be a custom class and not generic to all policies.
However, the initial part of my predict_probabilities
function is identical to the initial part of the predict
function, hence my desire to do the initial suggested refactor.
After the initial responses it seems that this would just be a matter of moving the code prior to the call of _predict
to a separate function in the base policy class. Would you be ok with such a change?
It's true that action_probability was part of SB2, however, in retrospect, it doesn't seem to be of much use... most of the time, users need direct access to the probability distribution, not just numpy output of probability, and for that, you can easily define a custom policy.
I think we should allow easy access for this much like in SB2 (we could just pass the distribution object that is already created as part of predict
). This would make it easier to debug/study how agent behaves. Maybe this should be part of evaluate_actions
to support setups where you do not provide actions (e.g. if no actions are provided, sample actions and return sampled actions and distribution)?
After the initial responses it seems that this would just be a matter of moving the code prior to the call of _predict to a separate function in the base policy class. Would you be ok with such a change?
I think we should allow easy access for this much like in SB2
I would disagree in providing the same feature as in SB2, as in my opinion it will add complexity for little to no usage (and additionally, this feature is not valid for DDPG/TD3).
I would agree with separating the code prior to _predict()
if it was used somewhere else.
Note: I'm willing to change my mind if we get more requests by users for this feature ;)
This would make it easier to debug/study how agent behaves.
the user can normally easily access the policy.distribution
for debug or policy.evaluate_actions
in the case of PPO
for instance.
I think it could be usefull for this kind of models in production. You could use that action probability for asking for help in a human controlled system, for example.
I think it could be very useful to use, as it is in a Classification problem in an Sklearn Model (using the predict_proba
method) . Also, I think it could be very helpfull to name it like so to make it easier for people used to that library instead of predict_probabilities
.
I tend to agree with Eloy here. The usefulness of the probability distribution remains as question for the user, but it could be useful for debugging how agent learns over time. As it stands, getting these probabilities out require more in-depth knowledge of this.
@araffin how would you still feel about this change? I realize it can not be done for all algorithms but that's just how they work. Another good function would be to get value estimates out (but that's for other time).
@araffin how would you still feel about this change? I realize it can not be done for all algorithms but that's just how they work.
Looking back at what was said, I still think we should not provide directly a predict_proba
method as it is highly algorithm specific (see DQN vs PPO vs TD3 vs SAC), requires a good understanding of what type of object is being used and can be easily implemented if one understand how an algorithm work.
Additionally, I think the refactor would generally make the code more readable and easier to extend parts of functionality to other similar uses. Note: I'm willing to change my mind if we get more requests by users for this feature ;)
On the other hand, I do agree with @tfurmston that we should provide some function to preprocess the observation (and maybe post-process the result) and allow it to be re-used by the user when implementing the custom predict_proba
+ add some pointers in the documentation
And for that, we would welcome a PR ;)
Please please make this happen, it would be so nice to see how 'certain' the neural net is of it's action on each step. I'm working on a wind/tides program for myself to classify environmental conditions into 'go kiting' or 'don't go kiting' :) I know, not how the gym env supposed to be used, but stable baselines just make it so easy to code even I can do it. I don't understand the mathematics enough underneath it to make a PR for you guys. I tried hacking around some print statements to no avail :( (maybe someone has a quick and dirty one for PPO).
Please please make this happen, it would be so nice to see how 'certain' the neural net is of it's action on each step. I'm working on a wind/tides program for myself to classify environmental conditions into 'go kiting' or 'don't go kiting' :) I know, not how the gym env supposed to be used, but stable baselines just make it so easy to code even I can do it. I don't understand the mathematics enough underneath it to make a PR for you guys. I tried hacking around some print statements to no avail :( (maybe someone has a quick and dirty one for PPO).
@ziegenbalg Actually you can make a trick for this with PPO. As @Miffyli said above, you can use the evaluate_actions
method for the policy object. This example worked for me (I think, maybe @Miffyli sees some error):
# Coverts lists to tensors
states_tensor = th.from_numpy(np.asmatrix(states))
actions_tensor = th.from_numpy(np.ones_like(np.asmatrix(actions))) # Prob of perform action
values, log_prob, entropy = best_model.policy.evaluate_actions(states_tensor,actions_tensor)
probs = np.exp(log_prob.detach().numpy())
Note that im trying to get the probability of perform an action (my action space is Binary at this case), therefore I use the np.ones_like
function.
@EloyAnguiano 's thing looks correct! Just bare in mind that actions_tensor
should correctly reflect what your actions would be like (e.g. if you have a discrete action space, then it should be a one-hot vector with only one "1" and rest zeros). It might not throw errors on wrong inputs (it was not designed to be used outside like this).
Thanks guys, looking into it. My action space is a Discrete(2), and I have a Dict() observation space. Any tips on how to convert the Dict Observation space to the tensor needed here?
I do not have code from top of my head here, so I think your best bet is to print out what the inputs to that function look like and follow that. It might be as simple as dictionary of torch tensors.
I finally had the time to make a PR here: https://github.com/DLR-RM/stable-baselines3/pull/559
I decided to decouple only 1) and 2) as the post-processing is very simple for the action and is used only for predict()
.
Right on! I've sort of come up with a different way by looking at the policy code... I noticed the obs_as_tensor function too.
model = PPO.load(filename)
obs = VecEnv.reset()
action, _states = model.predict(obs, deterministic=True)
category = env.action_map[int(action)]
obs = obs_as_tensor(obs, model.policy.device)
latent_pi, _, latent_sde = model.policy._get_latent(obs)
distribution = model.policy._get_action_dist_from_latent(latent_pi, latent_sde)
actions = distribution.get_actions(deterministic=True)
values, log_prob, dis = model.policy.evaluate_actions(obs, actions)
log_prob = log_prob.cpu()
prob = numpy.exp(log_prob.detach().numpy())[0]
print(values, log_prob, dis)
print ("Probability: ", prob)
print ("Category: ", category)
It seems to work, but it's only giving me one probability, which is fine since I only have two actions so I subtract from 1 to get the other. Does that look right?
Btw, do you guys have a patreon page? Stable-baselines is such an excellent project! It's taught me a lot about coding/machine learning and it's so straight forward. Love it!
(Sorry, the code formatting was messing up, so removed it...)
Try putting your code ``` like this ```, that should look nice :)
It seems to work, but it's only giving me one probability, which is fine since I only have two actions so I subtract from 1 to get the other. Does that look right?
Kind of. Your code is answering the question "what is the log-probability of the action it chose". You need to inspect the distribution
variable if you want to know probability of picking any one of the actions. The exact code depends on your action space, but for Discrete space this would be distribution.distribution.probs
(the distribution.distribution
object is a pytorch distribution object).
Btw, do you guys have a patreon page? Stable-baselines is such an excellent project! It's taught me a lot about coding/machine learning and it's so straight forward. Love it!
Nope, maintainers are doing this for their free-time and partially for their work :). Best way to contribute back is by giving comments, spotting errors and best of all: doing PRs to update things!
Well I only have two actions so if it answers the questions of "the log-probability of the action it chose", then I can deduce the other action probability easily.
And alrighty, though I think you guys should consider a patreon page, no reason to give up some free cake on the internet. Though I guess it would add a layer of democracy that's a little more work.
Is there a way to extract output score for DQN? The method described above only works for PPO.
Is there a way to extract output score for DQN? The method described above only works for PPO.
Duplicate of https://github.com/DLR-RM/stable-baselines3/issues/568
Any update to this? I've tried to derive the action like this, but it doesn't take into account the random seed and differs slightly in predictions from the predict() method.
def predict_proba(self, features):
# Reset environment to initial state and get initial observation
self.RLEnv = MyForexEnv(df=self.convert_ohlc_camelcase(features), window_size=10, frame_bound=(10, len(features)))
obs = self.RLEnv.reset()
obs = obs[np.newaxis, ...]
# Get the distribution object from the policy
dis = self.model.policy.get_distribution(torch.from_numpy(obs))
# Get the probabilities tensor from the distribution object
probs = dis.distribution.probs
# Convert the probabilities tensor to a NumPy array and detach it from the computation graph
probs_np = probs.detach().numpy()[0]
# calculate the action
action = np.argmax(probs_np)
# Return the probabilities array
return action, probs_np
🚀 Feature
A clear and concise description of the feature proposal.
At present the
predict
method in theBasePolicy
class contains quite a lot of logic that could be reused to provide similar functionality. In particular, the current logic of the this method is as follows:_predict
method, with these actions in the form of a PyTorch Tensor.My suggestion is that steps (1) and (3) are refactors into individual functions on the
BasePolicy
class, which are then called in thepredict
method.Motivation
I would like to introduce some policy classes for which I can calculate the action probabilities and not the actions themselves. (This is for some work on off-policy estimation that I am doing.)
Let's call this functionality
predict_probabilities
, then at present the initial logic of this functionality is identical to step (1) of thepredict
method. If the code is refactored as suggested, then both approaches can use the same pre-processing functionality.Additionally, I think the refactor would generally make the code more readable and easier to extend parts of functionality to other similar uses.
Pitch
I am happy to do a PR for the proposed refactor, so I would like to know whether or not you would be happy with the proposal.
Alternatives
None
Additional context
None
 Checklist