AdamGleave commented 4 years ago

Uses a more principled initialization of affine parameters to minimize least square error, rather than previous heuristic method matching mean and s.d. (which handles opposite rewards badly: sets scale to identity when should be zero).

Also add support for optimization via alternating minimization of affine parameters (analytic, closed form) and potential shaping weights (gradient descent). This worked well in the tabular setting. Seems to neither help nor hurt in the function approximator setting. Disabled by default.

codecov[bot] commented 4 years ago

Codecov Report

:exclamation: No coverage uploaded for pull request base (master@5b76ea7). Click here to learn what that means. The diff coverage is 83.06%.

@@            Coverage Diff            @@
##             master      #13   +/-   ##
=========================================
  Coverage          ?   76.02%           
=========================================
  Files             ?       45           
  Lines             ?     2886           
  Branches          ?        0           
=========================================
  Hits              ?     2194           
  Misses            ?      692           
  Partials          ?        0

Impacted Files	Coverage Δ
src/evaluating_rewards/envs/mujoco.py	`98.16% <ø> (ø)`
src/evaluating_rewards/analysis/stylesheets.py	`71.42% <ø> (ø)`
...uating_rewards/analysis/plot_divergence_heatmap.py	`66.66% <0%> (ø)`
src/evaluating_rewards/experiments/comparisons.py	`0% <0%> (ø)`
src/evaluating_rewards/scripts/train_regress.py	`37.5% <0%> (ø)`
...luating_rewards/experiments/point_mass_analysis.py	`78.12% <0%> (ø)`
tests/test_scripts.py	`100% <100%> (ø)`
...c/evaluating_rewards/analysis/gridworld_heatmap.py	`96.22% <100%> (ø)`
tests/test_comparisons.py	`100% <100%> (ø)`
src/evaluating_rewards/experiments/synthetic.py	`88.78% <100%> (ø)`
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 5b76ea7...ea0fcf8. Read the comment docs.

AdamGleave commented 4 years ago

Using NNLS initialization helps a lot; the alternating maximization doesn't seem to make much difference.

Uploading results for posterity. heatmaps.zip

HumanCompatibleAI / evaluating-rewards

Model comparison: NNLS initialization and alternating minimization #13

Codecov Report