Lunar Lander Example is Seriously Regressed

Description

I just found out that the original version of the Lunar Lander example was able to land successfully sometimes. In the current code, it never even gets remotely close. It can't even get a positive score.

The original code says:

# This is a work in progress, and currently takes ~100 generations to
# find a network that can land with a score >= 200 at least a couple of
# times.  It has yet to solve the environment

In the current code, I can run it for 500+ generations without it ever cresting above 0 reward. So something has seriously regressed. In reading the code, I now realize that the compute_fitness function makes no sense to me, so I believe there is some issue in confusing rewards with outputs. Also, the actual scores obtained when running the networks afterward are nowhere near the "fitness" being plotted. So this also points to there being a complete disconnect between "fitness" and actual score.

I will be debugging this in the next couple of days, but wanted to report the issue ahead of time.

To Reproduce

Steps to reproduce the behavior:

cd examples/openai-lander
python evolve.py
See a fitness.svg plot like the one below. We can't achieve a positive reward (solving the task would be a reward of +200).

fitness

Just thought I should leave an update on this issue...

Things I've learned:

The Lander example was never really working well.
- I'm actually not sure if I understand the author's comment which I quoted above. But when I run that version of the code, I see that the networks in general perform terribly, and in 100 generations there will be one or two times when the thing accidentally and miraculously gets 200+ points. However, the behavior is extremely random, and does not indicate to me that the problem was "learned" at all.
The example is trying to learn by doing reward prediction instead of using the reward directly as fitness.
- I believe this has serious pitfalls. For example, if the lander doesn't fire its engine, then it (often) doesn't get any penalty. So a great way to score well at "reward prediction" is to predict reward of no action to be 0. I think there are probably other weird feedback loops and types of mode collapse like this.
- I think this is made obvious by the plot I posted above. We quickly converge to 0 reward prediction error, but this doesn't help us whatsoever to actually solve the environment. When we look at the actual simulation scores, we're doing just as poorly as at the start of the simulation.
This commit is the one which regressed the example even further.
- Before, fitness was a combination of overall score and reward prediction error. In this commit, it was changed to be only reward prediction. The comment describing the "composite fitness" was not changed, so it's not clear whether the change was accidental.
- The new format of the evaluation prevents using actual score, and fitness can only be derived from reward prediction.
In general, Lunar Lander is probably going to be a very hard problem for NEAT. We receive basically no rewards until we land. It's extremely hard to discover this landing action by accident.

Actions I'm taking in response to this:

Refactoring the example to be able to run on more different types of Gym environments, so we can try it on something easier (but not as easy as cart-pole, which it seems to crush with almost no effort).
Refactoring to restore the original composite fitness formula, and be able to configure how fitness is computed.

CodeReclaimers / neat-python

Lunar Lander Example is Seriously Regressed #256

Description

To Reproduce