Thank you for the excellent library. I may have found a bug in how frame is tracked across training, and it comes from the implementation of where the frame = self.frame // self.num_agents update is inserted, which differs across both ContinuousA2CBase.train() and DiscreteA2CBase.train
In ContinuousA2CBase.train(), the update is inserted before self.frame += curr_frames, which I believe is the wrong implementation. Whereas in DiscreteA2CBase.train(), the update is inserted after self.frame += curr_frames, which I believe is the correct implementation.
After one interation of PPO training using num_envs=512 and horizon_length=16, ContinuousA2CBase.train() prints outs:
Hello!
Thank you for the excellent library. I may have found a bug in how frame is tracked across training, and it comes from the implementation of where the
frame = self.frame // self.num_agents
update is inserted, which differs across bothContinuousA2CBase.train()
andDiscreteA2CBase.train
In
ContinuousA2CBase.train()
, the update is inserted beforeself.frame += curr_frames
, which I believe is the wrong implementation. Whereas inDiscreteA2CBase.train()
, the update is inserted afterself.frame += curr_frames
, which I believe is the correct implementation.After one interation of PPO training using
num_envs=512
andhorizon_length=16
,ContinuousA2CBase.train()
prints outs:After modifying the update to be more similar to
DiscreteA2CBase.train()
, the print out is: