FeedForwardNetwork is catastrophically broken

ntraft commented 1 year ago

One-Line Summary

On almost all evolved genomes, FeedForwardNetwork does not execute all nodes, and many nodes retain their initial value of 0.0. This manifests itself as a network always returning an output of all zeros.

More Details

Sorry for the melodramatic headline, but I wanted to make it clear that this isn't just any old bug. As far as I can figure out, it would affect any and all feed-forward NEAT programs in a very dramatic way. I found the issue on the OpenAI Lunar Lander example. It's actually pretty interesting that the XOR and Cart-Pole examples still work well. The issue doesn't present itself until the network evolves for awhile, so it isn't immediately apparent.

I'm using the latest version of the repo (not the pip release). The code I'm looking at has been present since at least 2017, so I'm not exactly sure whether it's expected behavior.

The issue I'm seeing is basically that FeedForwardNetwork.node_evals does not actually contain all necessary nodes. In the net below, only the following nodes are being run: 2176, 1153, 1602, 1311. Notice that these are the only nodes which depend on an input, but do not depend on any other nodes (blue arrows).

winner-1-net-pruned

There are many non-input nodes which have no predecessors (red arrows). These nodes are not being included in the initial set of eligible nodes in neat.graphs.feed_forward_layers(). The fix would be to add these nodes as the first "layer" of the computation. I intend to fix this in my own fork and submit a pull request.

However, I'm not sure if this is the correct fix, or if this is an indicator of a deeper problem? Are these "dangling" input nodes expected? This seems related to #250. These nodes are largely unnecessary, yet they are not completely functionless (they essentially act as an additional "bias" parameter). Plus, they could become more integrated by the addition of new edges in later generations. So it seems like they belong there, but it is shocking that such a massive bug exists for years and makes me very suspicious of whether my fix is correct.

To Reproduce

Steps to reproduce the behavior:

cd examples/openai-lander
python evolve.py
See a fitness.svg plot like the one below (I have modified it for more clarity). We can't achieve a positive reward (solving the task would be a reward of +200).

fitness

ntraft commented 1 year ago

As I look into this further, it really does seem like it's not intended for any node in the graph not to have inputs. Many of the aggregation functions throw an error if called with an empty list. Is there some place in the NEAT algorithm where these dangling no-input nodes should be filtered out?

ntraft commented 1 year ago

I have read the original NEAT code (C++, from Stanley's thesis, 2001), and indeed have found this note before disabling a connection:

//We need to make sure that another gene connects out of the in-node
//Because if not a section of network will break off and become isolated

That's exactly what's happening here. Also, there is no "delete node" or "delete connection" mutation in the original work; there is only "toggle enable". So in the original NEAT there cannot be these dangling nodes without inputs.

I will likely code up a fix for this in the next week or so, but have bigger bugs to unravel first.

markste-in commented 11 months ago

maybe try out my 'fix'

it removes all the dangling nodes every iteration. I did a few tests and I get a bit more stable results

https://github.com/markste-in/neat-python/tree/remove_dangling_nodes

Finebouche commented 2 months ago

I proposed a fix here and added some code so that computation doesn't take into account any dandling nodes : #282

CodeReclaimers / neat-python