A doubt for behaviors that the agent finally converged

speedhawk commented 1 year ago

Hi, Sorry to trouble you again. In recent months I implemented a 'find-and-avoid' project by myself according to the thought of yours at the algorithm parts and other papers I read at the reward parts. However, the agent finally keeps going around in circles. After checking repeatedly I think the most possible reason is the reward sparsity which commonly leads to poor exploration so that the behaviors fall into the local optimal paradox. Based on this, I hope to discuss two questions with you, hope this will not spend you too much time! Q1. Based on your experiences, how effectively it can be improved if I deploy the curiosity-driving learning to solve the poverty exploration? Q2. I found another factor differing from yours is the action mask. By now I still not deploy this method yet. But may I ask you how many that will be improved if this method achieves? Thank you very much! I'll wait for your reply.

tsampazk commented 1 year ago

Hello @speedhawk! To be honest i can't be so sure about why your agent drives around in circles. It is a very common issue and depends on a lot of stuff, e.g. action scheme, continuous or discrete action space, reward about turning, etc., and a combination of those. I'd be happy to take a quick look at your code to maybe provide some more insights, if you want to share it.

As for the action mask, feel free to read up the description i've written in the relevant PR about the newest version of find and avoid. In the branch where the example currently lives i have uploaded a trained agent and the tensorboard logs. You can checkout that branch locally and run it yourself using the trained agent as long as you change this flag to True. Then with the other flags you can play around with deterministic or not deterministic actions, use masking or not, etc., and observe how the agent behaves differently.

I can tell you for sure that masking is what makes the agent i trained solve the problem on almost all the random maps during evaluation (near 100% accuracy). If i remember correctly without the mask (unmasked/regular PPO) the best accuracy i achieved was around 60%-70%. With accuracy i mean the ratio of the maps that the robot reached the target within the episode time limit. It would be quite an achievement to manage the same almost-perfect results without the mask, but i think that would take larger networks, maybe recurrent ones, and some other reward function. The mask i ended up using is quite complicated and basically hard codes some "expert" knowledge about avoiding obstacles and moving to a target and works really well in concert with the PPO agent. Nevertheless, i think there is room for improvement in the mask itself. With some colleagues we have currently submitted a research paper on this work in a conference, in the coming months we expect to publish it.

speedhawk commented 1 year ago

Hi! Firstly I will be very graceful for your explanation with a great detail including both the problem itself and your opnions on action mask. Actually, the thought of this technique came to my mind just at the beginning of my project, in which there is a large, continuous state space with, fortunately, the discrete action spaces. So I found your "find and avoid" project just with the thought how to prevent the agent from irrational behaviors in some special situations. In addition to this, I have read two papers in which the 'dynamic action rationalization'- which I call it like this - can be extended from discrete to continuous action space. Because of the horrifying complexity, I finally droped up implementing this method in my project >_< For my concurrent stage of my project, I should ask my supervisor if I can share it or not. And how about this? Do you mind me applying you for a on-line meeting by either zoom or teams or other software you like after my supervisor consent? Maybe it is suitable for me to both share my codes and ask you some questions about action mask. Thank you!

tsampazk commented 1 year ago

Actually, the thought of this technique came to my mind just at the beginning of my project, in which there is a large, continuous state space with, fortunately, the discrete action spaces.

I am not sure i understand exactly what you mean here. In general, the problem is probably more well-suited for a continuous action space where the agent directly controls the motor speeds, or go for a more high-level discrete action space where the agent decides on the next "move" which might be drive forward for set amount of centimeters or turn a set amount of degrees out of a discrete collection of possible actions. In find and avoid v2 on deepworlds, i used a discretized direct control of the motor speeds, to be able to use the action masking methodology in a straight-forward manner, but many other action schemes can be used, and i also tried others (like the collection of moves i described earlier) that worked quite well.

I actually have some simple ideas about implementing some kind of action masking on a continuous action space, but i don't have the time currently to work on it.

For my concurrent stage of my project, of course I can share it for you but I don't know how to send it.

You can create a repository and push your code there and share a link afterwards and i will be happy to check it out when i get a chance.

And how about this? Do you mind me applying you for a on-line meeting by either zoom or teams or other software you like? Maybe it is suitable for me to both share my codes and ask you some questions about action mask.

Unfortunately, i am not available right now for an online meeting, but we can exchange some ideas asynchronously here! :)

From personal experience, in general i suggest reading up on papers that are specific to obstacle avoidance for differential drive robots or similar, with generic techniques or reinforcement learning techniques to try to get some ideas, instead of delving into papers that present modifications or expansions of RL algorithms applied to other problems. Those papers can surely give you ideas but i think sticking to some well-established RL algorithm like PPO implemented in frameworks like stable-baselines3 and working on the problem itself can be more productive when starting out.

speedhawk commented 1 year ago

Hi. Aplogize for reply so late! I am sorry to tell that I cannot share my code by now after seeking the advice from my supervisor :( He said the result of this project is possible to be deployed into another research to be decided. I hope you can understand my difficulty and forgive my reneging :( The code may be exhibited here in the short future but by now it is not allowed to be shared without my supervisor's permission. Anyway, I still be very grateful for you discussing with me and your willingness to help me improve my code. By now I just decide to use a continous action space instead a discrete one, and try to consider the utilization of "behavior rationalization" into this continous action space with PPO algorithm >_<. Hope to discuss with you another time if possible! Many thanks!

speedhawk commented 1 year ago

Oh, BTW, may I ask you if it is correct to assign a box class instead of discrete class to action_space argument when I want a continous action output? Many thanks!

tsampazk commented 1 year ago

Oh, BTW, may I ask you if it is correct to assign a box class instead of discrete class to action_space argument when I want a continous action output? Many thanks!

You can check how we set the continuous action space here as an example for our continuous action space cartpole. This is in contrast with this which is for a discrete action space cartpole.

Hi. Aplogize for reply so late! I am sorry to tell that I cannot ...

No problem speedhawk! It's fully understandable. I would be very curious to see the published results you achieve in the future! I wish you the best! May i ask for some paper sources on "behavior rationalization" where you draw inspiration from, just out of curiosity to read up on the subject?

Feel free to open another discussion thread anytime! :smile:

speedhawk commented 1 year ago

Wow! I thought I cannot reply you after closing the issue>_<

May i ask for some paper sources on "behavior rationalization" where you draw inspiration from, just out of curiosity to read up on the subject?

Actually, "behavior rationalization" is my self-defined conception of the analogous thoughts of "action mask" in your code. I hastily defined it because I don't know how to describe it better😄. This is not a rigorious concept. And I think I can share you the link of a paper resource, which is one of the good papers I read before. However, this method is based on continuous action space instead of discrete: https://ojs.aaai.org/index.php/AAAI/article/view/5739 And hope this can help😄

tsampazk commented 1 year ago

Thank you @speedhawk!

aidudezzz / deepworlds

A doubt for behaviors that the agent finally converged #96