Closed tfederico closed 3 years ago
@ManifoldFR may I ask how you obtained the values for the model/policy hyperparams? Did you perform tuning using Optuna as in the RL zoo?
I started from the parameters of Jason Peng's code, but for things like the maximum grad norm, target KL or vf coef I had to make guesses because these were not parameters in his PPO implementation (also he had two separate optimizers for the policy and value functions).
On Tue, 24 Nov 2020, 16:37 Federico, notifications@github.com wrote:
@ManifoldFR https://github.com/ManifoldFR may I ask how you obtained the values for the model/policy hyperparams? Did you perform tuning using Optuna as in the RL zoo?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bulletphysics/bullet3/issues/3161#issuecomment-733054328, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFA427HJQ7DP7TAFYUPRE4DSRPHMLANCNFSM4UBBMUGA .
Training sometimes get stuck in such behavior. Did you try a couple of training runs?
What about the discount factor and lambda parameter for TD(lambda) ? Also, are you using my branch with the modifications to the Gym env? Here's a dropbox link with a policy trained with this (I think for that run I set a slightly higher learning rate)
@erwincoumans I tried a couple of runs using my script and another one using the training script from stable-baselines 3 zoo
@ManifoldFR I used the default values for the discount factor and lambda parameter. Did you use custom values? I wondered you also used the default ones given that you didn't list them with the other params. I used the version with action/observation scaling, so I guess it's the same.
Sorry about that, I use a strategy where I have a default set of PPO params on top of SB3's defaults, and the values I gave you were the overrides for the both of them. Check the hyperparams.yml in the Dropbox link I sent, I use the same discount and lambda (0.95) as Jason Peng. I think one of the important things was I use 4096 timesteps per env per rollout
On Tue, 24 Nov 2020, 16:47 Federico, notifications@github.com wrote:
@erwincoumans https://github.com/erwincoumans I tried a couple of runs using my script and another one using the training script from stable-baselines 3 zoo
@ManifoldFR https://github.com/ManifoldFR I used the default values for the discount factor and lambda parameter. Did you use custom values? I wondered you also used the default ones given that you didn't list them with the other params. I used the version with action/observation scaling, so I guess it's the same.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bulletphysics/bullet3/issues/3161#issuecomment-733060538, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFA427DM23A7MY6NJZJAP3DSRPISNANCNFSM4UBBMUGA .
Ah I see, no worries!
I was wondering whether I was doing something wrong in the training setup or when loading the model, but I figured there might be something wrong with the parameters given that the training would get stuck
Yes, the method is quite brittle I'm afraid, some hyperparameters can send you to very bad local minima. Have you looked at other papers like Facebook's ScaDiver ? The approach is the same but the subreward aggregation/early termination strategies are different. Maybe it's more robust but I haven't tested yet
I haven't read the paper but I saw their repo and video, seems very promising. I am trying to stick with DeepMimic because I don't want to change everything halfway :)
Also, if I recall correctly, they use a different format for clips (3d joints instead of quaternions maybe?), so I would have to adapt the tracking algorithm to that as well.
They use the more standard BVH format instead of the custom format used in deepmimic, they have code to convert to character poses in reduced coordinates to supply to pybullet
On Tue, 24 Nov 2020, 17:18 Federico, notifications@github.com wrote:
I haven't read the paper but I saw their repo and video, seems very promising. I am trying to stick with DeepMimic because I don't want to change everything halfway :)
Also, if I recall correctly, they use a different format for clips (3d joints instead of quaternions maybe?), so I would have to adapt the tracking algorithm to that as well.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bulletphysics/bullet3/issues/3161#issuecomment-733085363, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFA427B7A3AWBIY5AWRD4FDSRPMGPANCNFSM4UBBMUGA .
Btw I could help but notice that in deep_mimic_env.py
you calculate the reward before applying the new action, is that intentional?
reward = self._internal_env.calc_reward(agent_id)
# Apply control action
self._internal_env.set_action(agent_id, action)
I don't think it would actually make a huge difference, but it seemed a bit odd.
That's something I'm not 100% sure about. DeepMimic's interaction loop is pretty non-standard and it's hard to tell when the rewards are calculated: I think it's with respect to the current state s_t
before applying the action a_t
(and getting to state s_{t+1}
) rather than afterwards.
IMO either one works as long as you make sure the reference pose you're comparing the state to is the right one (same time step).
ScaDiver computes rewards wrt the state at time t
(using state data from before applying the action)
https://github.com/facebookresearch/ScaDiver/blob/96001537f9ab2eddfe871b78807923a30f7d012f/env_humanoid_base.py#L368-L385
i tried to train the character using the hyperparams given by @ManifoldFR in #3076 .
However, after 60 millions steps the character averages a reward of ~300/350 and when I test it the character walks by moving always the same foot and then dragging the other one.
Here are my training and enjoy scripts:
train
enjoy
In
deep_mimic_env.py
I modified the action space by using aFakeBox
class that inheritsgym.spaces.Box