MarcoMeter / recurrent-ppo-truncated-bptt

Baseline implementation of recurrent PPO using truncated BPTT
MIT License
123 stars 16 forks source link

Can this repo train continuous environments? #8

Closed 1900360 closed 1 year ago

1900360 commented 2 years ago

I have test this repo on 'MountainCarContinuous-v0', but it could't work. What needs to be modified here to run the continuous environment?

MarcoMeter commented 2 years ago

Hi @1900360 support for continuous action spaces is not implemented. Although, I think this should be fairly easy to implement.

1900360 commented 2 years ago

I tried changing the code but I get the following error:

`Traceback (most recent call last): File "C:\Users\1900.conda\envs\tf\lib\multiprocessing\connection.py", line 312, in _recv_bytes nread, err = ov.GetOverlappedResult(True) BrokenPipeError: [WinError 109] 管道已结束。

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "D:/desktop/lunwen_dabao/xinsuanfa0912/lstm_ppo_continue/train.py", line 35, in main() File "D:/desktop/lunwen_dabao/xinsuanfa0912/lstm_ppo_continue/train.py", line 31, in main trainer.run_training() File "D:\desktop\lunwen_dabao\xinsuanfa0912\lstm_ppo_continue\trainer.py", line 92, in run_training sampled_episode_info = self._sample_training_data() File "D:\desktop\lunwen_dabao\xinsuanfa0912\lstm_ppo_continue\trainer.py", line 157, in _sample_training_data obs, self.buffer.rewards[w, t], self.buffer.dones[w, t], info = worker.child.recv() File "C:\Users\1900.conda\envs\tf\lib\multiprocessing\connection.py", line 250, in recv buf = self._recv_bytes() File "C:\Users\1900.conda\envs\tf\lib\multiprocessing\connection.py", line 321, in _recv_bytes raise EOFError EOFError`

MarcoMeter commented 2 years ago

An EOFError usually indicates that something on the environment side did not work. For debugging purposes I recommend to run enjoy.py with an untrained model, because it does not use threading and therefore the actual exceptions are shown.

1900360 commented 2 years ago

I have changed your code to work in a continuous environment, such as Pendulum-v0, but the reward curve cannot rise during training, as shown below: image

Also I have attached the changed code behind. Please take the time to check if the code is correct: lstm_ppo_continue.zip

1900360 commented 2 years ago

Hope you can help me with this, after all I've been stuck with this code these days :(

MarcoMeter commented 2 years ago

Hi @1900360 I don't have the time to investigate your code. I can recommend this repo to look up the vital changes to allow for continuous actions. https://github.com/PG649-3D-RPG/neroRL/tree/develop Also have a look at https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo_continuous_action.py

One important thing is to change all activations form relu to tanh. Reward and observation normalization is also very important.

Hope this helps.

MarcoMeter commented 1 year ago

I close this issue as it seems stale. Please reach out again if you have more questions.