PaulPauls / Tensorflow-Neuroevolution

Neuroevolution Framework for Tensorflow 2.x focusing on modularity and high-performance. Preimplements NEAT, DeepNEAT, CoDeepNEAT
Apache License 2.0
125 stars 36 forks source link

CODEEPNEAT restoring state results in divide by zero. #20

Closed hyang0129 closed 1 year ago

hyang0129 commented 3 years ago

CODEEPNEAT restoring state results in divide by zero when training on GPU.

Backed up generation 17 of the CoDeepNEAT evolutionary run to file: /content/tfne_state_backups/tfne_state_backup_2021-Apr-10_13-57-22/tfne_state_backup_gen_17.json
---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
<ipython-input-13-da0c1a377c60> in <module>()
     12 
     13 # Start training process, returning the best genome when training ends
---> 14 best_genome = engine.train()
     15 print("Best genome returned by evolution:\n")
     16 print(best_genome)

3 frames
/usr/local/lib/python3.7/dist-packages/tfne/algorithms/codeepneat/_codeepneat_selection_mod.py in _select_modules_param_distance_fixed(self)
    216         for spec_id in mod_species_ordered:
    217             spec_fitness = self.pop.mod_species_fitness_history[spec_id][self.pop.generation_counter]
--> 218             spec_fitness_share = spec_fitness / total_avg_fitness
    219             spec_intended_size = int(round(spec_fitness_share * available_mod_pop))
    220 

ZeroDivisionError: division by zero
hyang0129 commented 3 years ago

This is actually caused by a problem in the initialize population method. When running initialize population from a restore, it skips the process of setting the input shape, thus all genomes have an input shape of none and evaluate at 0 fitness because they are invalid models.

https://github.com/PaulPauls/Tensorflow-Neuroevolution/blob/55c76f08ee4e4206d842565902b3d11c517c3756/tfne/algorithms/codeepneat/codeepneat.py#L99

joshianirudh commented 2 years ago

This is actually caused by a problem in the initialize population method. When running initialize population from a restore, it skips the process of setting the input shape, thus all genomes have an input shape of none and evaluate at 0 fitness because they are invalid models.

https://github.com/PaulPauls/Tensorflow-Neuroevolution/blob/55c76f08ee4e4206d842565902b3d11c517c3756/tfne/algorithms/codeepneat/codeepneat.py#L99

Mr Hongy did u find a solution for this?

joshianirudh commented 2 years ago

@PaulPauls can you please tell me how to fix this? I really need this functionality as I dont have the computation resources required to run CoDeepNEAT for a long time

Gabriel-Kissin commented 1 year ago

I've had this ZeroDivisionError in normal use, i.e. even when not restoring a population.

The docs state "If due to the random choice of modules for the blueprint graph an invalid TF model is generated from the genome genotype, the assembled genome is assigned a fitness score of 0". I guess when this happens randomly to all genomes, the total fitness is zero, which triggers the error:

File ...... in CoDeepNEATSelectionMOD._select_modules_param_distance_fixed(self)
    216 for spec_id in mod_species_ordered:
    217     spec_fitness = self.pop.mod_species_fitness_history[spec_id][self.pop.generation_counter]
--> 218     spec_fitness_share = spec_fitness [/](https://file+.vscode-resource.vscode-cdn.net/) total_avg_fitness
    219     spec_intended_size = int(round(spec_fitness_share * available_mod_pop))
    221     if len(self.pop.mod_species[spec_id]) + self.mod_spec_min_offspring > spec_intended_size:

ZeroDivisionError: division by zero

In fact looking more closely at the code in the region of the error, it seems that this error doesn't require all genomes in the population to have a fitness of zero. Actually, even if just ONE of the species has a total fitness of zero (i.e. all genomes in one species have zero fitness), then it will raise this error. To see this, look in the code in the file tfne/algorithms/codeepneat/_codeepneat_selection_mod.py in lines 216-227. Looping through all species, the last line of the loop subtracts the species fitness from the fitness of all species. So if the species at the end of the list of species has zero fitness, the amount subtracted is equal to the original total fitness. And now, total_avg_fitness is zero - resulting in the error.

I suppose species with zero fitness can be removed prior to this operation - that should prevent this error.

Gabriel-Kissin commented 1 year ago

Just to back up what I wrote in the previous comment, here is the output of the run which raised the error in Generation 46:

Evaluating 40 genomes in generation 45...
[========================================] 40/40 Genomes | Genome ID 1840 achieved fitness of 16.8438

############################################################  Population Summary  ############################################################

Generation:   45  ||  Best Genome Fitness:    44.25  ||  Avg Blueprint Fitness:  26.6785  ||  Avg Module Fitness:  25.6934
Best Genome: CoDeepNEAT Genome | ID:    186 | Fitness:  44.25 | Blueprint ID:     34 | Module Species: {1, 3} | Optimizer:    sgd | Origin Gen:    4

Blueprint Species       || Blueprint Species Avg Fitness       || Blueprint Species Size
     2                  ||  25.2025                            ||        3
Best BP of Species 2    || CoDeepNEAT Blueprint | ID:   #225 | Fitness: 28.5391 | Nodes:    9 | Module Species: {8, 6} | Optimizer: sgd
     4                  ||   26.438                            ||        4
Best BP of Species 4    || CoDeepNEAT Blueprint | ID:   #218 | Fitness: 28.8477 | Nodes:    8 | Module Species: {8, 6} | Optimizer: sgd
     5                  ||  28.4752                            ||        3
Best BP of Species 5    || CoDeepNEAT Blueprint | ID:   #228 | Fitness: 29.3125 | Nodes:    2 | Module Species: {6} | Optimizer: sgd

Module Species          || Module Species Avg Fitness          || Module Species Size
     6                  ||  26.6094                            ||        9
Best Mod of Species 6   || CoDeepNEAT DENSE Module | ID:   #523 | Fitness: 31.2864 | Units:   28 | Activ:   tanh | Dropout:  0.4
     8                  ||  24.9439                            ||       11
Best Mod of Species 8   || CoDeepNEAT DENSE Module | ID:   #529 | Fitness: 32.125 | Units:   12 | Activ:   tanh | Dropout:  0.4

##############################################################################################################################################

Evaluating 40 genomes in generation 46...
[========================================] 40/40 Genomes | Genome ID 1880 achieved fitness of 26.17191

############################################################  Population Summary  ############################################################

Generation:   46  ||  Best Genome Fitness:    44.25  ||  Avg Blueprint Fitness:  26.7656  ||  Avg Module Fitness:  18.4226
Best Genome: CoDeepNEAT Genome | ID:    186 | Fitness:  44.25 | Blueprint ID:     34 | Module Species: {1, 3} | Optimizer:    sgd | Origin Gen:    4

Blueprint Species       || Blueprint Species Avg Fitness       || Blueprint Species Size
     2                  ||  27.3138                            ||        3
Best BP of Species 2    || CoDeepNEAT Blueprint | ID:   #225 | Fitness: 28.9375 | Nodes:    9 | Module Species: {8, 6} | Optimizer: sgd
     4                  ||   29.612                            ||        3
Best BP of Species 4    || CoDeepNEAT Blueprint | ID:   #236 | Fitness: 30.5508 | Nodes:    8 | Module Species: {8, 6} | Optimizer: sgd
     5                  ||  24.2197                            ||        4
Best BP of Species 5    || CoDeepNEAT Blueprint | ID:   #228 | Fitness: 29.3789 | Nodes:    2 | Module Species: {6} | Optimizer: sgd

Module Species          || Module Species Avg Fitness          || Module Species Size
     6                  ||  28.8587                            ||       10
Best Mod of Species 6   || CoDeepNEAT DENSE Module | ID:   #550 | Fitness: 31.6407 | Units:   28 | Activ:   tanh | Dropout:  0.4
     8                  ||  26.6219                            ||        3
Best Mod of Species 8   || CoDeepNEAT DENSE Module | ID:   #529 | Fitness: 28.1701 | Units:   12 | Activ:   tanh | Dropout:  0.4
     9                  ||        0                            ||        2
Best Mod of Species 9   || CoDeepNEAT DENSE Module | ID:   #536 | Fitness:      0 | Units:   12 | Activ:   relu | Dropout:  0.4
    10                  ||        0                            ||        5
Best Mod of Species 10   || CoDeepNEAT DENSE Module | ID:   #537 | Fitness:      0 | Units:   32 | Activ:   tanh | Dropout:  0.2

##############################################################################################################################################

... and immediately after this the ZeroDivisionError was raised.

Note that in Gen45, all the species have fitnesses greater than zero. So no error. Whereas in Gen46, both MODULE species 9 and 10 have an average fitness of zero, and the best module of both species has a fitness of zero - that means that the total fitness of those species is zero - which is what caused the error, as explained.

Hope this helps someone!

PaulPauls commented 1 year ago

Hey, please excuse my very late answer to this thread. I have not found the time to properly maintain this project, which resulted from a research project into dynamic graphs using Tensorflow stemming from 2019. I am aware of a lot of bugs in this research prototype, some of which I know about since I continued this project privately. However I never found the time to document my progress and therefore never pushed it publicly. Since I can't seem to find the time to maintain this open-source project among my other engagements have I decided to archive it.

For future work employing evolutionary compution and algorithms do I highly recommend the Google EvoJax library. It utilizes Jax as the backend ML library, which is much better suited for dynamically changing graphs than Tensorflow however wasn't around back when I started this project. Please see here:

https://github.com/google/evojax