PaddlePaddle / PARL

A high-performance distributed training framework for Reinforcement Learning
https://parl.readthedocs.io/
Apache License 2.0
3.24k stars 819 forks source link

为什么我用cpu版本paddle运行科科老师的案例没问题,但是用gpu版本的话,奖励值一直处于比较低的值,这是为什么 #934

Closed OutSpace00 closed 2 years ago

OutSpace00 commented 2 years ago

为什么我用cpu版本paddle运行科科老师的案例没问题,但是用gpu版本的话,奖励值一直处于比较低的值,这是为什么

TomorrowIsAnOtherDay commented 2 years ago

哪个example呢?

OutSpace00 commented 2 years ago

DQN例子,还有DDPG例子都不行

TomorrowIsAnOtherDay commented 2 years ago

DQN 的例子多跑几趟也一样吗? 主要是强化学习需要随机探索来获得更好的奖励,从而提升策略的效果。这其中包含一定的随机性,有时候多跑几趟奖励就上去了。

OutSpace00 commented 2 years ago

是的,就一开始test reward还是随机到比较高的值50(最高分200)(CartPole-v0这个游戏),训练一个轮数就变成30分了,跑了几次了都是这个样子

OutSpace00 commented 2 years ago

但是,用cpu版本的paddle是没有问题的,我来回试了3次了,是不是需要改一些代码

OutSpace00 commented 2 years ago

这是用GPU跑DQN的结果,满分20: [07-20 14:09:46 MainThread @train.py:125] episode:50 e_greed:0.09930499999999931 Test reward:9.0 [07-20 14:09:47 MainThread @train.py:125] episode:100 e_greed:0.0987969999999988 Test reward:9.4 [07-20 14:09:48 MainThread @train.py:125] episode:150 e_greed:0.09830899999999831 Test reward:9.8 [07-20 14:09:49 MainThread @train.py:125] episode:200 e_greed:0.0977949999999978 Test reward:9.4 [07-20 14:09:50 MainThread @train.py:125] episode:250 e_greed:0.0972929999999973 Test reward:9.6 [07-20 14:09:52 MainThread @train.py:125] episode:300 e_greed:0.0967949999999968 Test reward:9.4 [07-20 14:09:53 MainThread @train.py:125] episode:350 e_greed:0.09630299999999631 Test reward:9.4 [07-20 14:09:54 MainThread @train.py:125] episode:400 e_greed:0.0957939999999958 Test reward:8.8 [07-20 14:09:55 MainThread @train.py:125] episode:450 e_greed:0.0952879999999953 Test reward:9.8 [07-20 14:09:56 MainThread @train.py:125] episode:500 e_greed:0.0947969999999948 Test reward:9.0 [07-20 14:09:58 MainThread @train.py:125] episode:550 e_greed:0.09431099999999432 Test reward:9.8 [07-20 14:09:59 MainThread @train.py:125] episode:600 e_greed:0.0937959999999938 Test reward:9.2 [07-20 14:10:00 MainThread @train.py:125] episode:650 e_greed:0.09330899999999331 Test reward:9.4 [07-20 14:10:01 MainThread @train.py:125] episode:700 e_greed:0.0927979999999928 Test reward:9.2 [07-20 14:10:03 MainThread @train.py:125] episode:750 e_greed:0.09230899999999231 Test reward:9.2 [07-20 14:10:04 MainThread @train.py:125] episode:800 e_greed:0.09182899999999183 Test reward:9.2 [07-20 14:10:05 MainThread @train.py:125] episode:850 e_greed:0.09133599999999134 Test reward:9.4 [07-20 14:10:06 MainThread @train.py:125] episode:900 e_greed:0.09083899999999084 Test reward:9.4 [07-20 14:10:07 MainThread @train.py:125] episode:950 e_greed:0.09035599999999036 Test reward:10.2 [07-20 14:10:09 MainThread @train.py:125] episode:1000 e_greed:0.08986599999998987 Test reward:9.6 [07-20 14:10:10 MainThread @train.py:125] episode:1050 e_greed:0.08936199999998937 Test reward:9.0 [07-20 14:10:11 MainThread @train.py:125] episode:1100 e_greed:0.08887799999998888 Test reward:9.2 [07-20 14:10:12 MainThread @train.py:125] episode:1150 e_greed:0.0883909999999884 Test reward:9.2 [07-20 14:10:13 MainThread @train.py:125] episode:1200 e_greed:0.0878979999999879 Test reward:9.8 [07-20 14:10:15 MainThread @train.py:125] episode:1250 e_greed:0.08741499999998742 Test reward:8.8 [07-20 14:10:16 MainThread @train.py:125] episode:1300 e_greed:0.08692899999998693 Test reward:9.6 [07-20 14:10:17 MainThread @train.py:125] episode:1350 e_greed:0.08643799999998644 Test reward:9.4 [07-20 14:10:18 MainThread @train.py:125] episode:1400 e_greed:0.08596299999998597 Test reward:9.2 [07-20 14:10:19 MainThread @train.py:125] episode:1450 e_greed:0.08547799999998548 Test reward:9.2 [07-20 14:10:20 MainThread @train.py:125] episode:1500 e_greed:0.08497999999998498 Test reward:9.8 [07-20 14:10:22 MainThread @train.py:125] episode:1550 e_greed:0.08448599999998449 Test reward:9.2 [07-20 14:10:23 MainThread @train.py:125] episode:1600 e_greed:0.08397399999998398 Test reward:9.8 [07-20 14:10:24 MainThread @train.py:125] episode:1650 e_greed:0.0834989999999835 Test reward:9.2 [07-20 14:10:25 MainThread @train.py:125] episode:1700 e_greed:0.082997999999983 Test reward:9.2 [07-20 14:10:27 MainThread @train.py:125] episode:1750 e_greed:0.08250899999998251 Test reward:9.4 [07-20 14:10:28 MainThread @train.py:125] episode:1800 e_greed:0.08200899999998201 Test reward:10.2 [07-20 14:10:29 MainThread @train.py:125] episode:1850 e_greed:0.08152099999998152 Test reward:9.2 [07-20 14:10:30 MainThread @train.py:125] episode:1900 e_greed:0.08103199999998104 Test reward:10.0 [07-20 14:10:31 MainThread @train.py:125] episode:1950 e_greed:0.08054399999998055 Test reward:9.4 [07-20 14:10:33 MainThread @train.py:125] episode:2000 e_greed:0.08005699999998006 Test reward:9.2

TomorrowIsAnOtherDay commented 2 years ago

感谢反馈,我们在GPU上试试看。

ljy2222 commented 2 years ago

@OutSpace00 我们使用了gpu版本的paddle进行DQN案例的训练,结果是正常收敛的,或许你可以增加max_episode数看看能否收敛。 实验结果如下,希望对你有帮助: [07-21 17:54:04 MainThread @utils.py:73] paddlepaddle version: 1.8.5. [07-21 17:54:04 MainThread @machine_info.py:88] nvidia-smi -L found gpu count: 1 W0721 17:54:04.508450 3348 device_context.cc:252] Please NOTE: device: 0, CUDA Capability: 61, Driver API Version: 10.0, Runtime API Version: 10.0 W0721 17:54:04.825601 3348 device_context.cc:260] device: 0, cuDNN Version: 7.6. [07-21 17:54:08 MainThread @train.py:129] episode:50 e_greed:0.0992959999999993 Test reward:9.8 [07-21 17:54:10 MainThread @train.py:129] episode:100 e_greed:0.0987949999999988 Test reward:9.4 [07-21 17:54:12 MainThread @train.py:129] episode:150 e_greed:0.0982989999999983 Test reward:8.8 [07-21 17:54:14 MainThread @train.py:129] episode:200 e_greed:0.09777799999999778 Test reward:10.8 [07-21 17:54:16 MainThread @train.py:129] episode:250 e_greed:0.09724799999999725 Test reward:10.0 [07-21 17:54:18 MainThread @train.py:129] episode:300 e_greed:0.09674599999999675 Test reward:9.6 [07-21 17:54:20 MainThread @train.py:129] episode:350 e_greed:0.09625499999999626 Test reward:9.4 [07-21 17:54:23 MainThread @train.py:129] episode:400 e_greed:0.09564799999999565 Test reward:9.8 [07-21 17:54:26 MainThread @train.py:129] episode:450 e_greed:0.09487799999999488 Test reward:9.8 [07-21 17:54:31 MainThread @train.py:129] episode:500 e_greed:0.09365399999999366 Test reward:9.2 [07-21 17:54:57 MainThread @train.py:129] episode:550 e_greed:0.08736799999998737 Test reward:167.0 [07-21 17:55:36 MainThread @train.py:129] episode:600 e_greed:0.07837799999997838 Test reward:183.6 [07-21 17:56:20 MainThread @train.py:129] episode:650 e_greed:0.06967899999996968 Test reward:200.0 [07-21 17:57:06 MainThread @train.py:129] episode:700 e_greed:0.06137499999996138 Test reward:176.4 [07-21 17:57:56 MainThread @train.py:129] episode:750 e_greed:0.05306199999995306 Test reward:140.2 [07-21 17:58:44 MainThread @train.py:129] episode:800 e_greed:0.04470999999994471 Test reward:198.4 [07-21 17:59:28 MainThread @train.py:129] episode:850 e_greed:0.037133999999937134 Test reward:153.2 [07-21 18:00:12 MainThread @train.py:129] episode:900 e_greed:0.029594999999929594 Test reward:125.6 [07-21 18:00:55 MainThread @train.py:129] episode:950 e_greed:0.022406999999922406 Test reward:122.2 [07-21 18:01:37 MainThread @train.py:129] episode:1000 e_greed:0.015557999999915674 Test reward:166.4 [07-21 18:02:18 MainThread @train.py:129] episode:1050 e_greed:0.01 Test reward:114.6 [07-21 18:03:06 MainThread @train.py:129] episode:1100 e_greed:0.01 Test reward:155.4 [07-21 18:03:50 MainThread @train.py:129] episode:1150 e_greed:0.01 Test reward:155.8 [07-21 18:04:42 MainThread @train.py:129] episode:1200 e_greed:0.01 Test reward:196.8 [07-21 18:05:35 MainThread @train.py:129] episode:1250 e_greed:0.01 Test reward:200.0 [07-21 18:06:24 MainThread @train.py:129] episode:1300 e_greed:0.01 Test reward:199.4 [07-21 18:07:23 MainThread @train.py:129] episode:1350 e_greed:0.01 Test reward:200.0 [07-21 18:08:19 MainThread @train.py:129] episode:1400 e_greed:0.01 Test reward:167.6 [07-21 18:09:15 MainThread @train.py:129] episode:1450 e_greed:0.01 Test reward:200.0 [07-21 18:09:57 MainThread @train.py:129] episode:1500 e_greed:0.01 Test reward:66.4 [07-21 18:10:50 MainThread @train.py:129] episode:1550 e_greed:0.01 Test reward:75.6 [07-21 18:11:43 MainThread @train.py:129] episode:1600 e_greed:0.01 Test reward:200.0 [07-21 18:12:36 MainThread @train.py:129] episode:1650 e_greed:0.01 Test reward:200.0 [07-21 18:13:29 MainThread @train.py:129] episode:1700 e_greed:0.01 Test reward:52.4 [07-21 18:14:24 MainThread @train.py:129] episode:1750 e_greed:0.01 Test reward:200.0 [07-21 18:15:18 MainThread @train.py:129] episode:1800 e_greed:0.01 Test reward:144.6 [07-21 18:16:14 MainThread @train.py:129] episode:1850 e_greed:0.01 Test reward:200.0 [07-21 18:17:09 MainThread @train.py:129] episode:1900 e_greed:0.01 Test reward:200.0 [07-21 18:18:04 MainThread @train.py:129] episode:1950 e_greed:0.01 Test reward:200.0 [07-21 18:18:52 MainThread @train.py:129] episode:2000 e_greed:0.01 Test reward:200.0

OutSpace00 commented 2 years ago

@OutSpace00 我们使用了gpu版本的paddle进行DQN案例的训练,结果是正常收敛的,或许你可以增加max_episode数看看能否收敛。 实验结果如下,希望对你有帮助: [07-21 17:54:04 MainThread @utils.py:73] paddlepaddle version: 1.8.5. [07-21 17:54:04 MainThread @machine_info.py:88] nvidia-smi -L found gpu count: 1 W0721 17:54:04.508450 3348 device_context.cc:252] Please NOTE: device: 0, CUDA Capability: 61, Driver API Version: 10.0, Runtime API Version: 10.0 W0721 17:54:04.825601 3348 device_context.cc:260] device: 0, cuDNN Version: 7.6. [07-21 17:54:08 MainThread @train.py:129] episode:50 e_greed:0.0992959999999993 Test reward:9.8 [07-21 17:54:10 MainThread @train.py:129] episode:100 e_greed:0.0987949999999988 Test reward:9.4 [07-21 17:54:12 MainThread @train.py:129] episode:150 e_greed:0.0982989999999983 Test reward:8.8 [07-21 17:54:14 MainThread @train.py:129] episode:200 e_greed:0.09777799999999778 Test reward:10.8 [07-21 17:54:16 MainThread @train.py:129] episode:250 e_greed:0.09724799999999725 Test reward:10.0 [07-21 17:54:18 MainThread @train.py:129] episode:300 e_greed:0.09674599999999675 Test reward:9.6 [07-21 17:54:20 MainThread @train.py:129] episode:350 e_greed:0.09625499999999626 Test reward:9.4 [07-21 17:54:23 MainThread @train.py:129] episode:400 e_greed:0.09564799999999565 Test reward:9.8 [07-21 17:54:26 MainThread @train.py:129] episode:450 e_greed:0.09487799999999488 Test reward:9.8 [07-21 17:54:31 MainThread @train.py:129] episode:500 e_greed:0.09365399999999366 Test reward:9.2 [07-21 17:54:57 MainThread @train.py:129] episode:550 e_greed:0.08736799999998737 Test reward:167.0 [07-21 17:55:36 MainThread @train.py:129] episode:600 e_greed:0.07837799999997838 Test reward:183.6 [07-21 17:56:20 MainThread @train.py:129] episode:650 e_greed:0.06967899999996968 Test reward:200.0 [07-21 17:57:06 MainThread @train.py:129] episode:700 e_greed:0.06137499999996138 Test reward:176.4 [07-21 17:57:56 MainThread @train.py:129] episode:750 e_greed:0.05306199999995306 Test reward:140.2 [07-21 17:58:44 MainThread @train.py:129] episode:800 e_greed:0.04470999999994471 Test reward:198.4 [07-21 17:59:28 MainThread @train.py:129] episode:850 e_greed:0.037133999999937134 Test reward:153.2 [07-21 18:00:12 MainThread @train.py:129] episode:900 e_greed:0.029594999999929594 Test reward:125.6 [07-21 18:00:55 MainThread @train.py:129] episode:950 e_greed:0.022406999999922406 Test reward:122.2 [07-21 18:01:37 MainThread @train.py:129] episode:1000 e_greed:0.015557999999915674 Test reward:166.4 [07-21 18:02:18 MainThread @train.py:129] episode:1050 e_greed:0.01 Test reward:114.6 [07-21 18:03:06 MainThread @train.py:129] episode:1100 e_greed:0.01 Test reward:155.4 [07-21 18:03:50 MainThread @train.py:129] episode:1150 e_greed:0.01 Test reward:155.8 [07-21 18:04:42 MainThread @train.py:129] episode:1200 e_greed:0.01 Test reward:196.8 [07-21 18:05:35 MainThread @train.py:129] episode:1250 e_greed:0.01 Test reward:200.0 [07-21 18:06:24 MainThread @train.py:129] episode:1300 e_greed:0.01 Test reward:199.4 [07-21 18:07:23 MainThread @train.py:129] episode:1350 e_greed:0.01 Test reward:200.0 [07-21 18:08:19 MainThread @train.py:129] episode:1400 e_greed:0.01 Test reward:167.6 [07-21 18:09:15 MainThread @train.py:129] episode:1450 e_greed:0.01 Test reward:200.0 [07-21 18:09:57 MainThread @train.py:129] episode:1500 e_greed:0.01 Test reward:66.4 [07-21 18:10:50 MainThread @train.py:129] episode:1550 e_greed:0.01 Test reward:75.6 [07-21 18:11:43 MainThread @train.py:129] episode:1600 e_greed:0.01 Test reward:200.0 [07-21 18:12:36 MainThread @train.py:129] episode:1650 e_greed:0.01 Test reward:200.0 [07-21 18:13:29 MainThread @train.py:129] episode:1700 e_greed:0.01 Test reward:52.4 [07-21 18:14:24 MainThread @train.py:129] episode:1750 e_greed:0.01 Test reward:200.0 [07-21 18:15:18 MainThread @train.py:129] episode:1800 e_greed:0.01 Test reward:144.6 [07-21 18:16:14 MainThread @train.py:129] episode:1850 e_greed:0.01 Test reward:200.0 [07-21 18:17:09 MainThread @train.py:129] episode:1900 e_greed:0.01 Test reward:200.0 [07-21 18:18:04 MainThread @train.py:129] episode:1950 e_greed:0.01 Test reward:200.0 [07-21 18:18:52 MainThread @train.py:129] episode:2000 e_greed:0.01 Test reward:200.0

谢谢你的耐心回答,增加max_episode是没有用的,我增加到20000效果也一样。我复制了以下所有的打印信息,你能帮我看看可能哪里出现问题了吗?是有很多错误警告的。

D:\Python3.7.0\python.exe "C:/Users/12478/Desktop/新建文件夹 (5)/dqn/train.py" [07-23 16:41:01 MainThread @logger.py:242] Argv: C:/Users/12478/Desktop/新建文件夹 (5)/dqn/train.py [07-23 16:41:02 MainThread @utils.py:73] paddlepaddle version: 2.3.1. D:\Python3.7.0\lib\site-packages\scipy\fftpack__init.py:103: DeprecationWarning: The module numpy.dual is deprecated. Instead of using dual, use the functions directly from numpy or scipy. from numpy.dual import register_func D:\Python3.7.0\lib\site-packages\scipy\sparse\sputils.py:16: DeprecationWarning: np.typeDict is a deprecated alias for np.sctypeDict. supported_dtypes = [np.typeDict[x] for x in supported_dtypes] D:\Python3.7.0\lib\site-packages\scipy\special\orthogonal.py:81: DeprecationWarning: np.int is a deprecated alias for the builtin int. To silence this warning, use int by itself. Doing this will not modify any behavior and is safe. When replacing np.int, you may wish to use e.g. np.int64 or np.int32 to specify the precision. If you wish to review your current use, check the release note link for additional information. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations from numpy import (exp, inf, pi, sqrt, floor, sin, cos, around, int, D:\Python3.7.0\lib\site-packages\pandas\compat\numpy__init__.py:10: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. _nlv = LooseVersion(_np_version) D:\Python3.7.0\lib\site-packages\pandas\compat\numpy\init.py:11: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. _np_version_under1p16 = _nlv < LooseVersion("1.16") D:\Python3.7.0\lib\site-packages\pandas\compat\numpy\init.py:12: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. _np_version_under1p17 = _nlv < LooseVersion("1.17") D:\Python3.7.0\lib\site-packages\pandas\compat\numpy\init.py:13: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. _np_version_under1p18 = _nlv < LooseVersion("1.18") D:\Python3.7.0\lib\site-packages\pandas\compat\numpy\init.py:14: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. _np_version_under1p19 = _nlv < LooseVersion("1.19") D:\Python3.7.0\lib\site-packages\pandas\compat\numpy\init__.py:15: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. _np_version_under1p20 = _nlv < LooseVersion("1.20") D:\Python3.7.0\lib\site-packages\setuptools_distutils\version.py:351: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. other = LooseVersion(other) D:\Python3.7.0\lib\site-packages\pandas\compat\numpy\function.py:125: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. if LooseVersion(_np_version) >= LooseVersion("1.17.0"): W0723 16:41:03.241605 3368 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 11.7, Runtime API Version: 10.2 W0723 16:41:03.249423 3368 gpu_resources.cc:91] device: 0, cuDNN Version: 7.6. [07-23 16:41:04 MainThread @train.py:126] episode:50 e_greed:0.0992939999999993 Test reward:9.2 [07-23 16:41:04 MainThread @train.py:126] episode:100 e_greed:0.09880699999999881 Test reward:9.2 [07-23 16:41:05 MainThread @train.py:126] episode:150 e_greed:0.09832099999999833 Test reward:9.8 [07-23 16:41:05 MainThread @train.py:126] episode:200 e_greed:0.09783399999999784 Test reward:9.4 [07-23 16:41:05 MainThread @train.py:126] episode:250 e_greed:0.09733099999999734 Test reward:8.8 [07-23 16:41:06 MainThread @train.py:126] episode:300 e_greed:0.09684099999999685 Test reward:9.6 [07-23 16:41:06 MainThread @train.py:126] episode:350 e_greed:0.09634499999999635 Test reward:9.4 [07-23 16:41:07 MainThread @train.py:126] episode:400 e_greed:0.09586099999999587 Test reward:9.0 [07-23 16:41:07 MainThread @train.py:126] episode:450 e_greed:0.09536099999999537 Test reward:9.4 [07-23 16:41:08 MainThread @train.py:126] episode:500 e_greed:0.09487999999999489 Test reward:9.4 [07-23 16:41:08 MainThread @train.py:126] episode:550 e_greed:0.0943919999999944 Test reward:9.6 [07-23 16:41:09 MainThread @train.py:126] episode:600 e_greed:0.0938979999999939 Test reward:9.6 [07-23 16:41:09 MainThread @train.py:126] episode:650 e_greed:0.09341899999999342 Test reward:9.4 [07-23 16:41:09 MainThread @train.py:126] episode:700 e_greed:0.09293799999999294 Test reward:9.4 [07-23 16:41:10 MainThread @train.py:126] episode:750 e_greed:0.09244799999999245 Test reward:8.8 [07-23 16:41:10 MainThread @train.py:126] episode:800 e_greed:0.09196599999999197 Test reward:9.0 [07-23 16:41:11 MainThread @train.py:126] episode:850 e_greed:0.09148699999999149 Test reward:8.8 [07-23 16:41:11 MainThread @train.py:126] episode:900 e_greed:0.090998999999991 Test reward:9.8 [07-23 16:41:12 MainThread @train.py:126] episode:950 e_greed:0.09051299999999052 Test reward:9.4 [07-23 16:41:12 MainThread @train.py:126] episode:1000 e_greed:0.09002099999999003 Test reward:9.2 [07-23 16:41:13 MainThread @train.py:126] episode:1050 e_greed:0.08953199999998954 Test reward:9.4 [07-23 16:41:13 MainThread @train.py:126] episode:1100 e_greed:0.08905299999998906 Test reward:9.0 [07-23 16:41:14 MainThread @train.py:126] episode:1150 e_greed:0.08857999999998858 Test reward:9.6 [07-23 16:41:14 MainThread @train.py:126] episode:1200 e_greed:0.08807899999998808 Test reward:9.4 [07-23 16:41:15 MainThread @train.py:126] episode:1250 e_greed:0.0875899999999876 Test reward:9.0 [07-23 16:41:15 MainThread @train.py:126] episode:1300 e_greed:0.0871009999999871 Test reward:9.0 [07-23 16:41:15 MainThread @train.py:126] episode:1350 e_greed:0.08662199999998663 Test reward:9.2 [07-23 16:41:16 MainThread @train.py:126] episode:1400 e_greed:0.08614199999998615 Test reward:10.0 [07-23 16:41:16 MainThread @train.py:126] episode:1450 e_greed:0.08565299999998566 Test reward:9.4 [07-23 16:41:17 MainThread @train.py:126] episode:1500 e_greed:0.08516299999998517 Test reward:8.6 [07-23 16:41:17 MainThread @train.py:126] episode:1550 e_greed:0.08468399999998469 Test reward:8.8 [07-23 16:41:18 MainThread @train.py:126] episode:1600 e_greed:0.08420299999998421 Test reward:9.6 [07-23 16:41:18 MainThread @train.py:126] episode:1650 e_greed:0.08371399999998372 Test reward:8.6 [07-23 16:41:19 MainThread @train.py:126] episode:1700 e_greed:0.08322999999998323 Test reward:9.6 [07-23 16:41:19 MainThread @train.py:126] episode:1750 e_greed:0.08274399999998275 Test reward:9.4 [07-23 16:41:20 MainThread @train.py:126] episode:1800 e_greed:0.08225399999998226 Test reward:9.6 [07-23 16:41:20 MainThread @train.py:126] episode:1850 e_greed:0.08176099999998176 Test reward:9.2 [07-23 16:41:21 MainThread @train.py:126] episode:1900 e_greed:0.08126499999998127 Test reward:9.4 [07-23 16:41:21 MainThread @train.py:126] episode:1950 e_greed:0.08077399999998078 Test reward:9.2 [07-23 16:41:22 MainThread @train.py:126] episode:2000 e_greed:0.08028299999998029 Test reward:9.2

Process finished with exit code 0

ljy2222 commented 2 years ago

你使用的是静态图版本的dqn,与所安装的paddle不匹配。可以尝试以下两种方法: 1、使用动态图版本的dqn进行训练,以下为github链接:https://github.com/PaddlePaddle/PARL/tree/develop/examples/tutorials/parl2_dygraph 2、降低paddle的版本,静态图dqn对应的paddle版本为1.8.5(你的是2.3.1),以下为安装教程:https://www.paddlepaddle.org.cn/install/old?docurl=/documentation/docs/zh/install/pip/linux-pip.html 希望对你有帮助~

OutSpace00 commented 2 years ago

谢谢你的回答,我在window11下试了一下还是不行,我改成乌邦图系统了,是可以运行的,感谢!

OutSpace00 commented 2 years ago

window11运行的动态图也是不行,paddle也没说支持window11

ljy2222 commented 2 years ago

好的,如果能在ubuntu上收敛,说明就没什么问题。 我们在windows10上进行测试是可以正常收敛的,目前paddle尚未支持windows11,所以可能存在一些适配问题。 感谢你的反馈~