Issue with pygad.load()

ahmedfgad / GeneticAlgorithmPython

Source code of PyGAD, a Python 3 library for building the genetic algorithm and training machine learning algorithms (Keras & PyTorch).

https://pygad.readthedocs.io

BSD 3-Clause "New" or "Revised" License

1.88k stars 463 forks source link

Issue with pygad.load() #88

Open nikospappas1987 opened 2 years ago

nikospappas1987 commented 2 years ago

When I load a previously saved instance of the genetic algorithm with ga_instance = pygad.load(filename=filename) the loaded instance has only the best solution as parent and not the selected number of parents from the save instance. To articulate, for num_parents_mating=2 and keep_parents=-1 the loaded instance has two identical parents (two copies of the best solution of the saved instance) and not the two parents of the saved instance

ahmedfgad commented 2 years ago

Could you please share your code?

I already had a similar example tested where I print the parents in the last generation using the last_generation_parents parameter after saving and also after loading the instance. I see the list of parents are identical.

ga_instance.last_generation_parents

nikospappas1987 commented 2 years ago

Sorry for the delayed answer but I was sick.

After paying more attention to it, there is no problem. It's just that the algorithm was very close to convergence so after one more round the two best solutions are identical. So it loads the two different parents but after one round the two best solutions get identical. Thanks for the advice on using ga_instance.last_generation_parents , I wouldn't find it without it. And thanks a lot for developing this great library.

Something different I noticed while trying to understand the above issue is that the ga_instance.best_solution() sometimes doesn't return the best solution. For example the code: solution, solution_fitness, solution_idx = ga_instance.best_solution() print("Fitness of the best solution :", solution_fitness) print(f'Fitness of the last solution :{ga_instance.last_generation_fitness}')

printed Fitness of the best solution : 0.792773668893014 Fitness of the last solution :[0.82573473 0.82573473 0.79277367 0.70507075 0.72833979 0.82573473 0.73806056 0.7600569 0.73776799 0.71602506]

which is obviously wrong. But after one generation it again prints the right solution: Fitness of the best solution : 0.8257347260699635 Fitness of the last solution :[0.82573473 0.82573473 0.7620138 0.77750815 0.79653262 0.77248549 0.76682924 0.73932479 0.76655292 0.77736544]

This thing happens randomly and rarery and from what I understand it doesn't seem to have any effect to the algorithm

urowietu commented 2 years ago

Are you using multithreading by any chance? Just a hunch/question.

Keith

Sent from my iPhone

On 10 Mar 2022, at 08:28, nikospappas1987 @.***> wrote:

Sorry for the delayed answer but I was sick.

After paying more attention to it, there is no problem. It's just that the algorithm was very close to convergence so after one more round the two best solutions are identical. So it loads the two different parents but after one round the two best solutions get identical. Thanks for the advice on using ga_instance.last_generation_parents , I wouldn't find it without it. And thanks a lot for developing this great library.

Something different I noticed while trying to understand the above issue is that the ga_instance.best_solution() sometimes doesn't return the best solution. For example the code: solution, solution_fitness, solution_idx = ga_instance.best_solution() print("Fitness of the best solution :", solution_fitness) print(f'Fitness of the last solution :{ga_instance.last_generation_fitness}')

printed Fitness of the best solution : 0.792773668893014 Fitness of the last solution :[0.82573473 0.82573473 0.79277367 0.70507075 0.72833979 0.82573473 0.73806056 0.7600569 0.73776799 0.71602506]

which is obviously wrong. But after one generation it again prints the right solution: Fitness of the best solution : 0.8257347260699635 Fitness of the last solution :[0.82573473 0.82573473 0.7620138 0.77750815 0.79653262 0.77248549 0.76682924 0.73932479 0.76655292 0.77736544]

This thing happens randomly and rarery and from what I understand it doesn't seem to have any effect to the algorithm

— Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you are subscribed to this thread.

ahmedfgad commented 2 years ago

Sorry for the delayed answer but I was sick.

After paying more attention to it, there is no problem. It's just that the algorithm was very close to convergence so after one more round the two best solutions are identical. So it loads the two different parents but after one round the two best solutions get identical. Thanks for the advice on using ga_instance.last_generation_parents , I wouldn't find it without it. And thanks a lot for developing this great library.

Something different I noticed while trying to understand the above issue is that the ga_instance.best_solution() sometimes doesn't return the best solution. For example the code: solution, solution_fitness, solution_idx = ga_instance.best_solution() print("Fitness of the best solution :", solution_fitness) print(f'Fitness of the last solution :{ga_instance.last_generation_fitness}')

printed Fitness of the best solution : 0.792773668893014 Fitness of the last solution :[0.82573473 0.82573473 0.79277367 0.70507075 0.72833979 0.82573473 0.73806056 0.7600569 0.73776799 0.71602506]

which is obviously wrong. But after one generation it again prints the right solution: Fitness of the best solution : 0.8257347260699635 Fitness of the last solution :[0.82573473 0.82573473 0.7620138 0.77750815 0.79653262 0.77248549 0.76682924 0.73932479 0.76655292 0.77736544]

This thing happens randomly and rarery and from what I understand it doesn't seem to have any effect to the algorithm

No issue @nikospappas1987. I hope you are recovered now!

I will check for the reason why the best_solution() method did not return the best fitness reported in last_generation_fitness. Thanks for reporting that!

ahmedfgad commented 2 years ago

Are you using multithreading by any chance? Just a hunch/question. Keith … Sent from my iPhone On 10 Mar 2022, at 08:28, nikospappas1987 @.***> wrote: Sorry for the delayed answer but I was sick. After paying more attention to it, there is no problem. It's just that the algorithm was very close to convergence so after one more round the two best solutions are identical. So it loads the two different parents but after one round the two best solutions get identical. Thanks for the advice on using ga_instance.last_generation_parents , I wouldn't find it without it. And thanks a lot for developing this great library. Something different I noticed while trying to understand the above issue is that the ga_instance.best_solution() sometimes doesn't return the best solution. For example the code: solution, solution_fitness, solution_idx = ga_instance.best_solution() print("Fitness of the best solution :", solution_fitness) print(f'Fitness of the last solution :{ga_instance.last_generation_fitness}') printed Fitness of the best solution : 0.792773668893014 Fitness of the last solution :[0.82573473 0.82573473 0.79277367 0.70507075 0.72833979 0.82573473 0.73806056 0.7600569 0.73776799 0.71602506] which is obviously wrong. But after one generation it again prints the right solution: Fitness of the best solution : 0.8257347260699635 Fitness of the last solution :[0.82573473 0.82573473 0.7620138 0.77750815 0.79653262 0.77248549 0.76682924 0.73932479 0.76655292 0.77736544] This thing happens randomly and rarery and from what I understand it doesn't seem to have any effect to the algorithm — Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you are subscribed to this thread.

Not yet!

nikospappas1987 commented 2 years ago

Are you using multithreading by any chance? Just a hunch/question. Keith … Sent from my iPhone On 10 Mar 2022, at 08:28, nikospappas1987 @.***> wrote: Sorry for the delayed answer but I was sick. After paying more attention to it, there is no problem. It's just that the algorithm was very close to convergence so after one more round the two best solutions are identical. So it loads the two different parents but after one round the two best solutions get identical. Thanks for the advice on using ga_instance.last_generation_parents , I wouldn't find it without it. And thanks a lot for developing this great library. Something different I noticed while trying to understand the above issue is that the ga_instance.best_solution() sometimes doesn't return the best solution. For example the code: solution, solution_fitness, solution_idx = ga_instance.best_solution() print("Fitness of the best solution :", solution_fitness) print(f'Fitness of the last solution :{ga_instance.last_generation_fitness}') printed Fitness of the best solution : 0.792773668893014 Fitness of the last solution :[0.82573473 0.82573473 0.79277367 0.70507075 0.72833979 0.82573473 0.73806056 0.7600569 0.73776799 0.71602506] which is obviously wrong. But after one generation it again prints the right solution: Fitness of the best solution : 0.8257347260699635 Fitness of the last solution :[0.82573473 0.82573473 0.7620138 0.77750815 0.79653262 0.77248549 0.76682924 0.73932479 0.76655292 0.77736544] This thing happens randomly and rarery and from what I understand it doesn't seem to have any effect to the algorithm — Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you are subscribed to this thread.

No multi threading in the callback function where the best_solution() method is called.

ahmedfgad commented 2 years ago

Sorry for the delayed answer but I was sick.

After paying more attention to it, there is no problem. It's just that the algorithm was very close to convergence so after one more round the two best solutions are identical. So it loads the two different parents but after one round the two best solutions get identical. Thanks for the advice on using ga_instance.last_generation_parents , I wouldn't find it without it. And thanks a lot for developing this great library.

Something different I noticed while trying to understand the above issue is that the ga_instance.best_solution() sometimes doesn't return the best solution. For example the code: solution, solution_fitness, solution_idx = ga_instance.best_solution() print("Fitness of the best solution :", solution_fitness) print(f'Fitness of the last solution :{ga_instance.last_generation_fitness}')

printed Fitness of the best solution : 0.792773668893014 Fitness of the last solution :[0.82573473 0.82573473 0.79277367 0.70507075 0.72833979 0.82573473 0.73806056 0.7600569 0.73776799 0.71602506]

which is obviously wrong. But after one generation it again prints the right solution: Fitness of the best solution : 0.8257347260699635 Fitness of the last solution :[0.82573473 0.82573473 0.7620138 0.77750815 0.79653262 0.77248549 0.76682924 0.73932479 0.76655292 0.77736544]

This thing happens randomly and rarery and from what I understand it doesn't seem to have any effect to the algorithm

Based on the tests, if a deterministic fitness function, then I did not found a difference between the fitness values calculated by these 2 lines.

_, solution_fitness, _ = ga_instance.best_solution()
max_last_generation_fitness = max(ga_instance.last_generation_fitness)

For a non-deterministic fitness function, these 2 values may differ. I did an experiment where the fitness value depends on a random number which makes the fitness value calculated by ga_instance.best_solution() differs.

You are welcome to report a case to test where the 2 values are different.

nikospappas1987 commented 2 years ago

Yes I noticed this issue with a non-deterministic fitness function but now I wonder why do the two values match most of the times? If it computes the fitness for each method, I would expect the two values to differ most of the times for non-deterministic fitness function

ahmedfgad commented 2 years ago

I agree. But things look normal and we need a test case which causes the 2 values to be different.

nikospappas1987 commented 2 years ago

Ok here's the code that produces it. I use the genetic algorithm to select features used at a LightGBM binary classifier. I'm sorry but I can't share the data I work on as they are health care data covered by a strict data sharing agreement.

  import pygad
  import pandas as pd
  import numpy as np
  from sklearn.pipeline import Pipeline
  from sklearn.model_selection import RepeatedStratifiedKFold
  from lightgbm import LGBMClassifier
  from sklearn.model_selection import cross_val_score
  import os
  import time
  import random

  s1 = 1
  s2 = 1
  filename = ''genetic_runs/boris_00_stemi_macce_t0_simple_data''

  def compute_fitness_roc(roc, s):
      if roc<0.5:
          return 0
      return ((roc-0.5)/0.5)**s

  def compute_fitness_sparse(num_selected, s):
      if num_selected>120:
          return 0
      elif num_selected<10:
          return 1
      return ((num_selected-120)/-110)**s

  def fitness_func(solution, solution_idx):
      selected = np.array(solution).astype(bool)
      X = X_train[X_train.columns[selected]]

      pipe = Pipeline([
                       ('clf', LGBMClassifier(is_unbalance=True,
                                              metric='auc',
                                              verbosity=-1,
                                              subsample=0.35,
                                              subsample_freq=4,
                                              extra_trees=True,
                                              colsample_bytree=1.0,
                                              feature_fraction_bynode=1.0,
                                              reg_alpha=6e-7,
                                              reg_lambda=0.03,
                                              learning_rate=0.001,
                                              linear_lambda=6.6e-07,
                                              max_depth=34,
                                              min_child_samples=1,
                                              n_estimators=80,
                                              objective='cross_entropy'))])

      scores = cross_val_score(pipe,
                               X, y_train,
                               scoring='roc_auc',
                               cv=RepeatedStratifiedKFold(n_splits=4, n_repeats=10),
                               n_jobs=-1)

      roc = np.mean(scores)
      fitness_roc = compute_fitness_roc(roc, s=s1)
      fitness_sparse = compute_fitness_sparse(np.sum(solution), s=s2)
      fitness = (fitness_roc*fitness_sparse)**0.5
      return fitness

  def callback_gen(ga_instance):
      if ga_instance.generations_completed%50 == 0:
          ga_instance.save(filename=filename)
          solution, solution_fitness, solution_idx = ga_instance.best_solution()
          print(time.ctime(time.time()))
          print("Generation : ", ga_instance.generations_completed)
          print("Fitness of the best solution :", solution_fitness)
          print(f'Number of features of best solution: {np.sum(solution)}')
          print(f'Fitness of the last solution :{ga_instance.last_generation_fitness}')
          print()

  if os.path.isfile(filename + '.pkl'):
      print('loading from file')
      ga_instance = pygad.load(filename=filename)
  else:
      ga_instance = pygad.GA(num_generations=5000,
                             num_parents_mating=2,
                             fitness_func=fitness_func,
                             sol_per_pop=10,
                             num_genes=X_train.columns.shape[0],
                             gene_type=int,
                             parent_selection_type="sss",
                             keep_parents=-1,
                             crossover_type="single_point",
                             crossover_probability=None,
                             mutation_type='adaptive',
                             mutation_probability=(0.04, 0.02),
                             gene_space=[0, 1],
                             stop_criteria='saturate_1000',
                             save_solutions=True,
                             on_generation=callback_gen
                             )

  ga_instance.run()
  ga_instance.save(filename=filename)

ahmedfgad commented 2 years ago

Thank you.