araffin / robotics-rl-srl

S-RL Toolbox: Reinforcement Learning (RL) and State Representation Learning (SRL) for Robotics
https://s-rl-toolbox.readthedocs.io
MIT License
607 stars 91 forks source link

[bug report] CSV table calculated by ./replay/gather_results.py is wrong (reward underestimated) #51

Open ncble opened 5 years ago

ncble commented 5 years ago

Describe the bug In the current master version, the script ./replay/gather_results.py is used to gather experiments and produce CSV table. Recently, when I tried to reproduce the results of your paper: Decoupling feature extraction from policy learning ..., I found that my rewards results are much better than yours (especially the Table 3 in page 7). This table compares the mean reward performance in RL (using PPO) in robotic arm (random target) environment (aka KukaButton with random target). I dug into the code, I did find the bug:

https://github.com/araffin/robotics-rl-srl/blob/1ab1bd366825f98f0282d05e32a3de0cbf7f0f9a/replay/gather_results.py#L136-L140

@kalifou has already confirmed this problem.

Explanation It's a rounding problem. Line 138 actually performs "floor" rather than float division (at least for KuKaButton, and maybe for the other environments too), since run_acc is an array of dtype int64.

Code example The problem actually comes from numpy. The following code reproduces this phenomenon:

import numpy as np
A = np.arange(10, dtype=np.int64)
print(A[:]/10) # np.array([0.0, 0.1, ..., 0.9])
A[:] = A[:] / 10 
print(A) # np.array([0, 0, ..., 0])

One funny thing is that A = A / 10 works (not as "floor"), but not A[:] = A[:] / 10.

Solution

Use my code instead: ./replay/postprocessing_logs.py (temporary name). My code can directly produce the Latex table and deal with the most of situations (heterogeneous data: different number of experiments, different lengths, scalable "checkpoints" (timesteps), different SRL models)

The following picture is a demo of my code:


Question

Are there similar problems elsewhere in the toolbox ?