alexis-jacq / Pytorch-DPPO

Pytorch implementation of Distributed Proximal Policy Optimization: https://arxiv.org/abs/1707.02286
MIT License
180 stars 40 forks source link

Failed in more complex environment #6

Closed kkjh0723 closed 6 years ago

kkjh0723 commented 6 years ago

Thank for the sharing the code. I tested that the code is working in the invertedPendulum-v1.

But when I changed the environment to Ant-v1 without changing any other parameters, it seems the agent failed to learn as below. Do I need to change some parameters?

Time 00h 01m 01s, episode reward -3032.25671304008, episode length 1000
Time 00h 02m 01s, episode reward -99.15254692012928, episode length 25
Time 00h 03m 01s, episode reward -41.27665909454931, episode length 14
Time 00h 04m 01s, episode reward -39.077425184658665, episode length 17
Time 00h 05m 02s, episode reward -136.60746428384076, episode length 45
Time 00h 06m 02s, episode reward -111.40062667574634, episode length 40
Time 00h 07m 02s, episode reward -516.1070385678166, episode length 169
Time 00h 08m 02s, episode reward -129.64627338344073, episode length 42
Time 00h 09m 02s, episode reward -146.55425861577797, episode length 45
Time 00h 10m 03s, episode reward -253.41361049200614, episode length 86
Time 00h 11m 03s, episode reward -108.6953450777496, episode length 38
Time 00h 12m 03s, episode reward -64.66194807902957, episode length 16
Time 00h 13m 03s, episode reward -33.51695185844647, episode length 11
Time 00h 14m 03s, episode reward -86.88904449639067, episode length 35
Time 00h 15m 03s, episode reward -78.48049851223362, episode length 23
Time 00h 16m 03s, episode reward -165.73681903021165, episode length 61
Time 00h 17m 04s, episode reward -155.3555664457943, episode length 60
Time 00h 18m 04s, episode reward -57.65249942070945, episode length 20
Time 00h 19m 04s, episode reward -392.10161323743887, episode length 109
Time 00h 20m 04s, episode reward -55.63287075930159, episode length 12
Time 00h 21m 04s, episode reward -81.0448173961397, episode length 29
Time 00h 22m 04s, episode reward -149.84827826419726, episode length 52
Time 00h 23m 04s, episode reward -398.0365800924663, episode length 22
Time 00h 24m 05s, episode reward -1948.6136580594682, episode length 17
Time 00h 25m 05s, episode reward -18719.08471382285, episode length 51
Time 00h 26m 06s, episode reward -805145.8854457787, episode length 1000
Time 00h 27m 06s, episode reward -17008.04843510176, episode length 17
Time 00h 28m 07s, episode reward -168769.34038655, episode length 129
Time 00h 29m 07s, episode reward -104933.08883886453, episode length 79
Time 00h 30m 07s, episode reward -22809.687035617088, episode length 17
Time 00h 31m 07s, episode reward -46398.71530676861, episode length 37
Time 00h 32m 07s, episode reward -18513.064083079746, episode length 15
Time 00h 33m 07s, episode reward -21329.411481710402, episode length 15
Time 00h 34m 09s, episode reward -1393903.341478124, episode length 1000
Time 00h 35m 10s, episode reward -1374988.6133415946, episode length 1000
Time 00h 36m 10s, episode reward -33792.40522011441, episode length 28
Time 00h 37m 10s, episode reward -20629.94697013807, episode length 16
Time 00h 38m 10s, episode reward -39780.93399623488, episode length 29
Time 00h 39m 10s, episode reward -61722.81635309537, episode length 47
Time 00h 40m 10s, episode reward -46780.12455378964, episode length 36
Time 00h 41m 10s, episode reward -91640.36757206521, episode length 73
Time 00h 42m 11s, episode reward -77137.71004513587, episode length 63
Time 00h 43m 11s, episode reward -15184.611248485926, episode length 10
Time 00h 44m 11s, episode reward -26995.023495691694, episode length 20
Time 00h 45m 11s, episode reward -110371.66228435331, episode length 81
Time 00h 46m 11s, episode reward -55639.738879114084, episode length 41
Time 00h 47m 11s, episode reward -53735.2616539847, episode length 39
Time 00h 48m 11s, episode reward -60755.49631228513, episode length 43
Time 00h 49m 11s, episode reward -29466.664499076247, episode length 23
Time 00h 50m 12s, episode reward -48580.31395829051, episode length 37
Time 00h 51m 12s, episode reward -128957.8903571858, episode length 99
Time 00h 52m 12s, episode reward -70144.76359014906, episode length 51
Time 00h 53m 12s, episode reward -29271.097255889938, episode length 21
Time 00h 54m 12s, episode reward -21737.6644599086, episode length 17
Time 00h 55m 12s, episode reward -27549.40889570978, episode length 20
Time 00h 56m 12s, episode reward -97097.66966694668, episode length 77
Time 00h 57m 13s, episode reward -18384.51761876518, episode length 14
Time 00h 58m 13s, episode reward -28424.585660954337, episode length 22
Time 00h 59m 13s, episode reward -96267.24448946006, episode length 72
Time 01h 00m 13s, episode reward -79794.54738721657, episode length 60
Time 01h 01m 13s, episode reward -88486.88046448736, episode length 64
Time 01h 02m 13s, episode reward -31071.50782185118, episode length 24
Time 01h 03m 13s, episode reward -53608.97197643964, episode length 38
Time 01h 04m 14s, episode reward -38451.031800392186, episode length 27
Time 01h 05m 14s, episode reward -27645.787896926682, episode length 20
alexis-jacq commented 6 years ago

Indeed, I was doing something wrong by using the probabilities of action given 'old' models. I corrected this error into my PPO implementation, and now it seems to learn in all environments. I'm going to update everything.