-
### What does it mean when we roll out PPO with numsteps > episode length
I know from the code that it will recycle the environment after you pass the terminal timestep. The question that I have is…
-
I think it's still TF experts right now (incompatible with our repo since torch port). Addresses part of #215.
-
## Purpose
The purpose of this issue(discussion) is to introduce a series of prs in the near future targeted to releasing tianshou's full benchmark for MuJoCo Gym task suite.
This benchmark will inc…
-
Hello. I'm just getting started with SpinningUp and have encountered an issue when I try to run ExperimentGrid. Full disclosure: I'm running Windows and I followed the instructions linked on the spinn…
-
I failed to train some PPO agents on Acrobot-v1, the test reward never change. It stays at -500. My code is same as test/discrete/test_ppo, except the env is Acrobot-v1. Also, when I use a custom ac…
-
### System information
- **OS Platform and Distribution (e.g., Linux Ubuntu 16.04)**: Linux Ubuntu 16.04
- **Ray installed from (source or binary)**: source
- **Ray version**: 0.6.5
- **Python…
-
### What is the problem?
When running a simple RLlib training script, almost identical to the example [here](https://docs.ray.io/en/master/rllib-training.html#basic-python-api), I get the follo…
-
**Important Note: We do not do technical support, nor consulting** and don't answer personal questions per email.
Please post your question on the [RL Discord](https://discord.com/invite/xhfNqQv), [R…
-
### Question
Hello, I am writing a custom evaluation callback for ppo (mostly based on SB3 evaluation callback). Since PPO uses VecNormalize for the training envs, how could I pass the statistics f…
-
### Question
For PPO, I understand that advantage normalization (for each batch of experiences) is sort of a standard practice. I've seen other implementations do it, too. However, I find it a litt…