ahmedfgad / GeneticAlgorithmPython

Source code of PyGAD, a Python 3 library for building the genetic algorithm and training machine learning algorithms (Keras & PyTorch).
https://pygad.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.89k stars 465 forks source link

PyGAD gene_space cannot accept numpy linspace dtype=int #140

Open jdmoore7 opened 2 years ago

jdmoore7 commented 2 years ago

I am trying to use PyGAD to optimize hyper-parameters in ML models. According to documentation

The gene_space parameter customizes the space of values of each gene ... list, tuple, numpy.ndarray, or any range like range, numpy.arange(), or numpy.linspace: It holds the space for each individual gene. But this space is usually discrete. That is there is a set of finite values to select from.

As you can see, the first element of gene_space, which corresponds to solution[0] in the Genetic Algorithm definition, is an array of integers. According to documentation, this should be a discrete space, which it is. However, when this array of integers (from np.linspace, which is okay to use), it is interpreted by Random Forest Classifier as a numpy.float64'> (see error in 3rd code block.)

I don't understand where this change of data type is occurring. Is this a PyGAD problem and how can I fix? Or is it a numpy -> sklearn problem?

gene_space = [ 
    # n_estimators
    np.linspace(50,200,25, dtype='int'),
    # min_samples_split, 
    np.linspace(2,10,5, dtype='int'),
    # min_samples_leaf,
    np.linspace(1,10,5, dtype='int'),
    # min_impurity_decrease
    np.linspace(0,1,10, dtype='float')
]

The definition of the Genetic Algorithm

def fitness_function_factory(data=data, y_name='y', sample_size=100):

    def fitness_function(solution, solution_idx):
        model = RandomForestClassifier(
            n_estimators=solution[0],
            min_samples_split=solution[1],
            min_samples_leaf=solution[2],
            min_impurity_decrease=solution[3]
        )

        X = data.drop(columns=[y_name])
        y = data[y_name]
        X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                            test_size=0.5)        

        train_idx = sample_without_replacement(n_population=len(X_train), 
                                              n_samples=sample_size)         

        test_idx = sample_without_replacement(n_population=len(X_test), 
                                              n_samples=sample_size) 

        model.fit(X_train.iloc[train_idx], y_train.iloc[train_idx])
        fitness = model.score(X_test.iloc[test_idx], y_test.iloc[test_idx])

        return fitness 

    return fitness_function

And the instantiation of the Genetic Algorithm

cross_validate = pygad.GA(gene_space=gene_space,
                      fitness_func=fitness_function_factory(),
                      num_generations=100,
                      num_parents_mating=2,
                      sol_per_pop=8,
                      num_genes=len(gene_space),
                      parent_selection_type='sss',
                      keep_parents=2,
                      crossover_type="single_point",
                      mutation_type="random",
                      mutation_percent_genes=25)

cross_validate.best_solution()
>>>
ValueError: n_estimators must be an integer, got <class 'numpy.float64'>.

Any recommendations on resolving this error?

EDIT: I've tried the below to successful results:

model = RandomForestClassifier(n_estimators=gene_space[0][0])
model.fit(X,y)

So the issue does not lie with numpy->sklearn but with PyGAD.

borisarloff commented 2 years ago

@jdmoore7, looking at your code there is no parameter gene_type set for the GA constructor. The default gene_type is float, see https://pygad.readthedocs.io/en/latest/README_pygad_ReadTheDocs.html#pygad-ga-class. In your case define gene_type = [int, int, int, float], hopefully that will solve your problem.