Closed erotofaust closed 5 years ago
Did u get this to work?
Did someone get this to work already? This feature is really important for my studies.
I just ran into this myself while working on my YouTube series. It would be amazing to do the "discovery" part of the ML as imitation learning, and then let PPO perform optimization.
@awjuliani @vincentpierre @unityjeffrey You guys have any comment on how one can start with imitation learning and use other forms of learning to improve?
This would be really useful for me too thanks!
Yep, this feature makes a lot of sense! Would be very interested in it.
I'm trying to continue learning the agent with the PPO after training with Behavioral Cloning. But I lack knowledge.
EDIT:
I was able to change the Behavioral Cloning code so that it would be closer to the model that is created by the PPO.I changed how to restore the trained model and now I can load a BC-trained model so that learning continues in the PPO. But at first, learning is forgotten with PPO training.
I also ran into this problem during my training. Is there a way to solve?
@Ina299 , there is another post where I share with all of you my way to continue the training with PPO. But I have not proved to continue learning with PPO is improving learning or previous training with BC only disrupts the learning of PPO.
Hi everyone, just wanted to give an update that this feature is on our roadmap. Though we don't have a specific timeline, we understand this would be a valuable addition.
Without changing the existing trainers it won't be possible currently.
Hi, I used ppo to "imitate" by giving proportional positive reward to the absolute difference between teacher and student action. In addition I set the gamma hyperparameter to very low number (10e-8) to take in account in learning process only the last step reward.
@ishaybe , I use a bomberman like game as my platform of researchs. The actions of my agent are discrete, so in my case, making the absolute difference of the actions did not work very well. I simply gave a positive reward when the student did the same, and a negative reward when he went wrong. It did not work very well. Later, when I tried to continue training with the normal ppo without mimicking the specialist, my agent did not take advantage of this previous training.
How would you make that difference with discrete actions?
You should lower the discount factor nearly to zero.
Hi all. This is a feature we have in progress for a future release in the near future. I will close this issue for now due to inactivity.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
I tried to train my AI but it didn't improve at some point. maybe it's some saddle point of this AI.
so I used the imitation learning for making default model.
but when I use --load in prompt, train by ppo, it doesn't work well because some key is missing in the imitated model.
how can I use imitated model?