td3_implementation analysis

CUN-bjy / gym-td3-keras

Keras Implementation of TD3(Twin Delayed DDPG) with PER(Prioritized Experience Replay) option on OpenAI gym framework

GNU General Public License v3.0

10 stars 4 forks source link

td3_implementation analysis #10

Open CUN-bjy opened 3 years ago

CUN-bjy commented 3 years ago

the first TD3 implementation do not work well..

so.. have to analysis each part of the differences from ddpg

CUN-bjy commented 3 years ago

[TEST 1]

rebase to ddpg
add target policy smoothing term.
test on RoboschoolInvertedPendulum-v1.
batch_size -> 64, hidden_layers -> 24, 16(for each actor and critic)
lr-> 1e-4,1e-3, tau->1e-3,1e-3(for each actor and critic)
it works.(cause of exploration problem, I tested 5 times more.. success rate is under 30%.) :heavy_check_mark:

[TEST 2]

basically, tested on InvertedPendulum, almost same to above parameters and added target policy smoothing.
change parameters, lr->3e-4,3e-4, tau->5e-3,5e-3 (same to original td3 code)
it works.(also have exploration problem, success rate is not high.) :heavy_check_mark:

[TEST 3]

add delayed policy update term.
delayed update peoriod -> 2
it works.(have exploration problem, success rate is too low, and feel like delayed to learn a good policy) :heavy_check_mark:

CUN-bjy commented 3 years ago

[TEST 4]

rebase to ddpg
add target policy smoothing term.
test on RoboschoolInvertedPendulum-v1.
batch_size -> 64, hidden_layers -> 24, 16(for each actor and critic)
lr->3e-4,3e-4, tau->5e-3,5e-3 (for each actor and critic)
(reset policy update interval to 1)
add double cliped Q update term
but only use Q1 for target update
it works. :heavy_check_mark:

[TEST 5]

same to above experiment.
use Q1,Q2 for target update
it doesn't work well..

Screenshot from 2021-01-28 19-19-59

Screenshot from 2021-01-28 19-19-07

[TEST 6]

test on RoboschoolInvertedPendulum-v1.
TD3 set.
add target policy smoothing term.
add delayed policy update term. update_interval -> 2
add double cliped Q update term
batch_size -> 64, hidden_layers -> 24, 16(for each actor and critic)
lr->3e-4,3e-4, tau->5e-3,5e-3 (for each actor and critic)
also doesn't work well..(take long time to learn the policy and getting worse)

:eyes: I think the my implementation of `double cliped q update` has some problems.

CUN-bjy commented 3 years ago

[TEST 7]

test on RoboschoolInvertedPendulum-v1.
add double cliped Q update term (only)
batch_size -> 64, hidden_layers -> 24, 16(for each actor and critic)
lr->3e-4,3e-4, tau->5e-3,5e-3 (for each actor and critic)
yes.. doesn't work.. Catastrophic forgetting problem happen, even in one task.

[TEST 8]

same condition to above.
changed some codes that make independent optimizer for critics
I think that is working, but very slow compared to simple ddpg.
by double cliped q update overestimation problem is fixed, but that make policy update so slow.
should try integrated system and than should change some parameters..

CUN-bjy commented 3 years ago

[TEST 9]

integrate that all.. -> doesn't work..

CUN-bjy commented 3 years ago

[TEST 10]

initial random policy added for exploration
and there's some mistake..

before

        a = agent.make_action(obs,t)
        action = np.argmax(a) if is_discrete else a

        # do step on gym at t-step
        new_obs, reward, done, info = env.step(action) 

        # store the results to buffer   
        agent.memorize(obs, a, reward, done, new_obs) 
                # should've memorize action w/ noise!!

after

        a = agent.make_action(obs,t)
        action = np.argmax(a) if is_discrete else a

        # do step on gym at t-step
        new_obs, reward, done, info = env.step(action) 

        # store the results to buffer   
        agent.memorize(obs, action, reward, done, new_obs)

but, consequently, doesn't work..

CUN-bjy commented 3 years ago

[TEST 11]

use OU noise process for off-policy exploration strategy

also, doesn't work.

CUN-bjy commented 3 years ago

[TEST 12]

use OU noise process for off-policy exploration strategy
without BatchNormalization, Weight Regularization, Groot Initializer on actor and critic.

CUN-bjy / gym-td3-keras

td3_implementation analysis #10

:eyes: I think the my implementation of double cliped q update has some problems.

:eyes: I think the my implementation of `double cliped q update` has some problems.