Optimisation of constants for Python executable phenotypes

LSivo commented 3 years ago

Hello everyone! I'm using PonyGE2 with pybnf grammars, in which I have a number of constants. I stumbled across "optimize_constants" here, and it looks interesting, but I'm afraid it works just for "evaluable" phenotypes, not for "executable" ones. Am I wrong? If I am, does anybody have more detailed information to share so I can make it work? Instead, if I'm not, is there any short-term plan to make optimisation of constants work with executable phenotypes too? Thanks in advance!

jmmcd commented 3 years ago

Hmm, it seems like it should be easy.

All of our supervised learning fitness functions assume that the phenotype is a single expression, and use eval. It's not just an issue with optimize_constants.

So, I guess you already have some custom code which uses exec, right? If you post it, I can suggest how to add the optimize_constants.

LSivo commented 3 years ago

Thanks for your answer, @jmmcd. As you guess, I have some grammars that are meant to produce Python executable code, and I'm using a custom fitness based on F1-score. Here it follows:

from fitness.base_ff_classes.base_ff import base_ff
from utilities.fitness.get_data import get_data
from utilities.representation.python_filter import python_filter
from algorithm.parameters import params
import numpy as np

class handpd(base_ff):
    maximise = True

    def __init__(self):
        # Initialise base fitness function class.
        super().__init__()

        # Get training and test data
        self.training_in, self.training_exp, self.test_in, self.test_exp = \
            get_data(params['DATASET_TRAIN'], params['DATASET_TEST'])

        # Find number of variables.
        self.n_vars = np.shape(self.training_in)[0]

        # Regression/classification-style problems use training and test data.
        if params['DATASET_TEST']:
            self.training_test = True

    def evaluate(self, ind, **kwargs):
        dist = kwargs.get('dist', 'training')

        if dist == "training":
            # Set training datasets.
            x = self.training_in
            y = self.training_exp

        elif dist == "test":
            # Set test datasets.
            x = self.test_in
            y = self.test_exp

        else:
            raise ValueError("Unknown dist: " + dist)

        tp = 0
        fp = 0
        fn = 0

        p, d = ind.phenotype, {'is_within': is_within}

        for i in range(x.shape[1]):
            d['x'] = x[:, i]
            exec(p, d)
            y_p = d['result']
            assert np.isrealobj(y_p)
            if y_p == 1 and y[i] == 1:
                tp += 1
            elif y_p == 1 and y[i] == 0:
                fp += 1
            elif y_p == 0 and y[i] == 1:
                fn += 1

        f1 = (2 * tp) / ((2 * tp) + fp + fn)
        # print(tp, fp, fn, f1)
        return f1

def is_within(val, a, b):
    return min(a, b) <= val <= max(a, b)

Sorry, I didn't manage to render the code properly, so there are parts of it out of the code section. However, the problem is that c is unknown, maybe I shoul pass it within the dictionary d, but I don't know how. I've seen that in the supervised learning template there's some code to manage the constant optimization option, but still I don't understand how to manage the execution of the code... I think I'm messing up.

jmmcd commented 3 years ago

The trick with formatting a large block of code is to use triple backticks before and after. I've fixed your comment. You can re-edit it to see the backtick syntax.

I was thinking of adding exec support to optimize_constants in a generic way, but I don't think I can do it in a way that supports your code. All of your custom code with d['x'] and tp, fp, fn would have to be replicated in the optimize_constants file, so it wouldn't be generic.

So, let's see if we can make this part more generic first.

for i in range(x.shape[1]):
    d['x'] = x[:, i]
    exec(p, d)
    y_p = d['result']

Here I think your idea is to pass one training instance in to p at a time, right? I think it's the wrong way around -- normally in the Scikit-Learn convention, each row of the dataset is a single training instance. So should we have this instead?

for i in range(x.shape[0]):
    d['x'] = x[i, :]
    exec(p, d)
    y_p = d['result']

Second, can you look at your grammar and phenotypes and see whether they can run in a vectorised way? Our supervised learning code assumes that everything is vectorised. So, the result of this would be:

d['x'] = x
exec(p, d)
y_p = d['result']

One common stumbling block in vectorisation is when you need to run if-statements. The vectorised analogue of if is numpy.where: https://numpy.org/doc/stable/reference/generated/numpy.where.html. This is sometimes enough to turn an exec situation back to an eval situation.

Next, I see you want to get the tp, fp, and eventually f1 scores. But if we have vectors y and y_p, then we can calculate all of these in a vectorised way, rather than one at a time. Scikit-Learn gives code for f1.

And indeed, f1 is already available as a PonyGE error metric: https://github.com/PonyGE/PonyGE2/blob/master/src/utilities/fitness/error_metric.py.

LSivo commented 3 years ago

Thank you, @jmmcd. About the x.shape[0] comment, I agree with you: I usually treat each row of a dataset as a single training instance, but when I started to use PonyGE2 I found that x = self.training_in, for instance, returns somehow a transposed version of the dataset, and this is why I ended up using x.shape[1] and d['x'] = x[:, i]. About your second comment, I'll need some time to check if I can change my grammars in a reasonable time so that I can use a vectorised style. Thank you anyway for your help!

jmmcd commented 3 years ago

About the x.shape[0] comment, I agree with you: I usually treat each row of a dataset as a single training instance, but when I started to use PonyGE2 I found that x = self.training_in, for instance, returns somehow a transposed version of the dataset, and this is why I ended up using x.shape[1] and d['x'] = x[:, i].

Yes, you're right. I've made a new issue #130 for that discussion. For this issue, let's continue as-is.

jmmcd commented 3 years ago

I've made the change and closed #130. Happy to continue discussion here about vectorisation, or how to get things done using exec.

PonyGE / PonyGE2

Optimisation of constants for Python executable phenotypes #129