StepNeverStop / RLs

Reinforcement Learning Algorithms Based on PyTorch
https://stepneverstop.github.io
Apache License 2.0
449 stars 93 forks source link

change algoritms AC, A2C #9

Closed kmakeev closed 5 years ago

kmakeev commented 5 years ago

Windows 10 64bit

Python 3.7.4 (default, Aug 9 2019, 18:34:13) [MSC v.1915 64 bit (AMD64)] :: Anaconda, Inc. on win32 Type "help", "copyright", "credits" or "license" for more information.

import tensorflow as tf import tensorflow_probability as tfp tf.version '2.0.0' tfp.version '0.8.0'

I think the problem is in using the self.log_std variable provided that self.action_type == 'continuous' not True. Made corrections in AC A2C algorithms without major code changes.

If accepted, I can fix it in all the algorithms.

Variables: buffer_size = 10000, n_step = False, use_priority = False, added as they are passed during initialization but not used

checked: run.py --gym -a a2c -n train_using_gym -g --gym-env Pendulum-v0 -c pendium.yaml --render-episode 10 --gym-agents 10


Episode: 0 | step: 2000 | last_done_step 200 | rewards: [-1468.50164152 -1811.11490323 -1806.3436553 -1574.70802813 -1742.46361236 -1415.58934203 -1799.47012725 -1696.38476904 -1192.32592293 -1721.64378872] Save checkpoint success. Episode: 0

Episode: 1 | step: 2000 | last_done_step 200 | rewards: [-1434.99519383 -1448.44176266 -1393.00592602 -1354.65056504 -1427.48830281 -1443.8693746 -1189.99389219 -1313.95160696 -1382.81534188 -1449.26688975]

Episode: 2 | step: 2000 | last_done_step 200 | rewards: [-1491.32792793 -1489.42480822 -1492.75892733 -1536.39978863 -1501.01724587 -1457.09673399 -1500.85537643 -1446.80477618 -1448.72687266 -1479.68659308]

Episode: 3 | step: 2000 | last_done_step 200 | rewards: [-1504.2449832 -1465.19519976 -1523.04100326 -1478.21385386 -1482.52474363 -1501.57386045 -1514.2597484 -1520.82455499 -1509.38399946 -1484.84079824]

Episode: 4 | step: 2000 | last_done_step 200 | rewards: [-1451.32146037 -1533.18187006 -1411.46303724 -1503.77129072 -1497.27813524 -1526.85406816 -1448.59529423 -1302.19286231 -1422.33973501 -1516.66485667]

Episode: 5 | step: 2000 | last_done_step 200 | rewards: [-1347.96713201 -1520.10352601 -1466.30637285 -1524.85140392 -1481.17077353 -1510.99593459 -1465.25419643 -1441.60844851 -1535.04824137 -1506.24829903]

Episode: 6 | step: 2000 | last_done_step 200 | rewards: [-1512.49585936 -1446.00358618 -1501.95955225 -1419.98276059 -1492.19823679 -1494.31697722 -1509.63944632 -1463.83733884 -1497.31158038 -1495.92954845]

Episode: 7 | step: 2000 | last_done_step 200 | rewards: [-1365.00672581 -1407.10549583 -1433.48233532 -1489.06268359 -1494.56518857 -1375.42711696 -1464.86018874 -1313.95134751 -1451.31757617 -1444.46398607]

Episode: 8 | step: 2000 | last_done_step 200 | rewards: [-1456.69334411 -1376.85040123 -1506.97755202 -1378.27739043 -1341.31655843 -1487.01453735 -1498.4909909 -1399.48425679 -1473.79647552 -1359.5997454 ]

Episode: 9 | step: 2000 | last_done_step 200 | rewards: [-1472.41701912 -1403.38661446 -1333.11979492 -1326.89821018 -1339.76717001 -1435.50663919 -1427.75070538 -1342.8987578 -1452.51951221 -1334.61260788]

Episode: 10 | step: 2000 | last_done_step 200 | rewards: [-1440.81017002 -1414.55493642 -1469.58791484 -1364.42337724 -1480.74832791 -1333.45897376 -1371.33415953 -1331.4386031 -1376.06845363 -1340.78222536]

Episode: 11 | step: 2000 | last_done_step 200 | rewards: [-1242.44722826 -1251.16276341 -1296.33423968 -1327.6753765 -1329.44337323 -1177.05095447 -1354.3813774 -1315.46196407 -1370.89072081 -1352.95899707]

Episode: 12 | step: 2000 | last_done_step 200 | rewards: [-1364.85903723 -1185.04604059 -1187.48685057 -1236.54148324 -1319.00133781 -1272.39473315 -1331.45757119 -1349.07392349 -1250.12191444 -1363.90312902]

Episode: 13 | step: 2000 | last_done_step 200 | rewards: [-1264.12283138 -1470.52539401 -1186.26185611 -1267.3347617 -1282.269064 -1328.80208855 -1103.50631982 -1200.78932116 -1329.30791345 -1321.31298649]

Episode: 14 | step: 2000 | last_done_step 200 | rewards: [-1163.37060511 -1170.54238032 -1187.86674456 -1217.44113964 -1393.64141862 -1251.80047167 -1149.25162074 -1268.18146292 -1293.30130757 -1189.30892555]

Best result!

pendium.yaml

actor_lr: 0.0005 batch_size: 32 beta: 0.001 critic_lr: 0.001 epoch: 4 epsilon: 0.2 gamma: 0.99 hidden_units: actor_continuous:

run.py --gym -a a2c -n train_using_gym -g --gym-env CartPole-v1 --render-episode 10 --gym-agents 10


Episode: 0 | step: 2000 | last_done_step 54 | rewards: [17. 12. 12. 16. 17. 13. 25. 22. 54. 13.] Save checkpoint success. Episode: 0

Episode: 1 | step: 2000 | last_done_step 68 | rewards: [20. 30. 26. 62. 24. 68. 35. 18. 40. 61.]

Episode: 2 | step: 2000 | last_done_step 102 | rewards: [ 84. 102. 19. 42. 27. 28. 23. 41. 91. 67.]

Episode: 3 | step: 2000 | last_done_step 79 | rewards: [31. 30. 47. 32. 79. 57. 45. 23. 44. 45.]

Episode: 4 | step: 2000 | last_done_step 132 | rewards: [ 62. 76. 59. 64. 35. 94. 75. 59. 132. 28.]

Episode: 5 | step: 2000 | last_done_step 297 | rewards: [ 77. 102. 75. 63. 187. 297. 138. 33. 132. 30.]

Episode: 6 | step: 2000 | last_done_step 278 | rewards: [135. 140. 194. 278. 185. 83. 219. 179. 22. 117.]

Episode: 7 | step: 2000 | last_done_step 262 | rewards: [127. 133. 262. 236. 213. 252. 126. 107. 238. 208.]

Episode: 8 | step: 2000 | last_done_step 315 | rewards: [181. 77. 259. 38. 69. 240. 285. 194. 182. 315.]

Episode: 9 | step: 2000 | last_done_step 429 | rewards: [429. 179. 382. 246. 242. 275. 212. 242. 253. 239.]

Episode: 10 | step: 2000 | last_done_step 324 | rewards: [241. 279. 324. 150. 278. 242. 229. 244. 312. 269.]

Episode: 11 | step: 2000 | last_done_step 500 | rewards: [273. 207. 285. 359. 500. 111. 500. 270. 133. 386.]

Episode: 12 | step: 2000 | last_done_step 500 | rewards: [297. 500. 486. 500. 187. 500. 270. 361. 347. 255.]

Best result!

StepNeverStop commented 5 years ago

thx! after merged it. I will make some small changes myself.