Can I use the output model of [imitation learning] to default model of [ppo learning]?

Unity-Technologies / ml-agents

The Unity Machine Learning Agents Toolkit (ML-Agents) is an open-source project that enables games and simulations to serve as environments for training intelligent agents using deep reinforcement learning and imitation learning.

https://unity.com/products/machine-learning-agents

Other

17.19k stars 4.16k forks source link

Can I use the output model of [imitation learning] to default model of [ppo learning]? #541

Closed erotofaust closed 5 years ago

erotofaust commented 6 years ago

I tried to train my AI but it didn't improve at some point. maybe it's some saddle point of this AI.

so I used the imitation learning for making default model.

but when I use --load in prompt, train by ppo, it doesn't work well because some key is missing in the imitated model.

how can I use imitated model?

Lakrix commented 6 years ago

Did u get this to work?

cemunds commented 6 years ago

Did someone get this to work already? This feature is really important for my studies.

dracolytch commented 6 years ago

I just ran into this myself while working on my YouTube series. It would be amazing to do the "discovery" part of the ML as imitation learning, and then let PPO perform optimization.

ghost commented 6 years ago

@awjuliani @vincentpierre @unityjeffrey You guys have any comment on how one can start with imitation learning and use other forms of learning to improve?

sidagarwal2805 commented 6 years ago

This would be really useful for me too thanks!

GeorgSatyros commented 6 years ago

Yep, this feature makes a lot of sense! Would be very interested in it.

icaro56 commented 6 years ago

I'm trying to continue learning the agent with the PPO after training with Behavioral Cloning. But I lack knowledge.

EDIT:

I was able to change the Behavioral Cloning code so that it would be closer to the model that is created by the PPO.I changed how to restore the trained model and now I can load a BC-trained model so that learning continues in the PPO. But at first, learning is forgotten with PPO training.

Ina299 commented 6 years ago

I also ran into this problem during my training. Is there a way to solve?

icaro56 commented 6 years ago

@Ina299 , there is another post where I share with all of you my way to continue the training with PPO. But I have not proved to continue learning with PPO is improving learning or previous training with BC only disrupts the learning of PPO.

icaro56 commented 6 years ago

https://github.com/Unity-Technologies/ml-agents/issues/1032

harperj commented 5 years ago

Hi everyone, just wanted to give an update that this feature is on our roadmap. Though we don't have a specific timeline, we understand this would be a valuable addition.

Without changing the existing trainers it won't be possible currently.

ishaybe commented 5 years ago

Hi, I used ppo to "imitate" by giving proportional positive reward to the absolute difference between teacher and student action. In addition I set the gamma hyperparameter to very low number (10e-8) to take in account in learning process only the last step reward.

icaro56 commented 5 years ago

@ishaybe , I use a bomberman like game as my platform of researchs. The actions of my agent are discrete, so in my case, making the absolute difference of the actions did not work very well. I simply gave a positive reward when the student did the same, and a negative reward when he went wrong. It did not work very well. Later, when I tried to continue training with the normal ppo without mimicking the specialist, my agent did not take advantage of this previous training.

How would you make that difference with discrete actions?

ishaybe commented 5 years ago

You should lower the discount factor nearly to zero.

awjuliani commented 5 years ago

Hi all. This is a feature we have in progress for a future release in the near future. I will close this issue for now due to inactivity.

github-actions[bot] commented 3 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.