Reference Code : gym-ddpg-keras
(DDPG)
Keras Implementation of TD3(Twin Delayed Deep Deterministic Policy Gradient) with PER(Prioritized Experience Replay) option on OpenAI gym framework
STATUS : IN PROGRESS
This branch is just for debugging, change the branch to main.
Test on Simulation
Network Model & Hyperparameter
Differences from DDPG
The target policy smoothing is implemented by adding to the actions chosen by the target actor network, clipped to (−0.5, 0.5).
Delayed policy updates consists of only updating the actor and target critic network every d iterations, with d = 2.
(While a larger d would result in a larger benefit with respect to accumulating errors, for fair comparison, the critics are only trained once per time step, and training the actor for too few iterations would cripple learning.)
Exploration
To remove the dependency on the initial parameters of the policy we use a purely exploratory policy for the first 10000 time steps of stable length environments.
Afterwards, we use an off-policy exploration strategy, adding Gaussian noise N (0, 0.1) to each action.
(we found noise drawn from the Ornstein-Uhlenbeck process offered no performance benefits.)
Evaluation
virtualenv
# install virtualenv module
sudo apt-get install python3-pip
sudo pip3 install virtualenv
# create a virtual environment named venv
virtualenv venv
# activate the environment
source venv/bin/activate
To escape the environment, deactivate
pip install -r requirements.txt
#trainnig
python train.py
[1] Addressing Function Approximation Error in Actor-Critic Methods
@misc{fujimoto2018addressing,
title={Addressing Function Approximation Error in Actor-Critic Methods},
author={Scott Fujimoto and Herke van Hoof and David Meger},
year={2018},
eprint={1802.09477},
archivePrefix={arXiv},
primaryClass={cs.AI}
}
[3] sfujim/TD3