Open LSivo opened 3 years ago
Hmm, it seems like it should be easy.
All of our supervised learning fitness functions assume that the phenotype is a single expression, and use eval
. It's not just an issue with optimize_constants
.
So, I guess you already have some custom code which uses exec
, right? If you post it, I can suggest how to add the optimize_constants
.
Thanks for your answer, @jmmcd. As you guess, I have some grammars that are meant to produce Python executable code, and I'm using a custom fitness based on F1-score. Here it follows:
from fitness.base_ff_classes.base_ff import base_ff
from utilities.fitness.get_data import get_data
from utilities.representation.python_filter import python_filter
from algorithm.parameters import params
import numpy as np
class handpd(base_ff):
maximise = True
def __init__(self):
# Initialise base fitness function class.
super().__init__()
# Get training and test data
self.training_in, self.training_exp, self.test_in, self.test_exp = \
get_data(params['DATASET_TRAIN'], params['DATASET_TEST'])
# Find number of variables.
self.n_vars = np.shape(self.training_in)[0]
# Regression/classification-style problems use training and test data.
if params['DATASET_TEST']:
self.training_test = True
def evaluate(self, ind, **kwargs):
dist = kwargs.get('dist', 'training')
if dist == "training":
# Set training datasets.
x = self.training_in
y = self.training_exp
elif dist == "test":
# Set test datasets.
x = self.test_in
y = self.test_exp
else:
raise ValueError("Unknown dist: " + dist)
tp = 0
fp = 0
fn = 0
p, d = ind.phenotype, {'is_within': is_within}
for i in range(x.shape[1]):
d['x'] = x[:, i]
exec(p, d)
y_p = d['result']
assert np.isrealobj(y_p)
if y_p == 1 and y[i] == 1:
tp += 1
elif y_p == 1 and y[i] == 0:
fp += 1
elif y_p == 0 and y[i] == 1:
fn += 1
f1 = (2 * tp) / ((2 * tp) + fp + fn)
# print(tp, fp, fn, f1)
return f1
def is_within(val, a, b):
return min(a, b) <= val <= max(a, b)
Sorry, I didn't manage to render the code properly, so there are parts of it out of the code section.
However, the problem is that c
is unknown, maybe I shoul pass it within the dictionary d
, but I don't know how. I've seen that in the supervised learning template there's some code to manage the constant optimization option, but still I don't understand how to manage the execution of the code... I think I'm messing up.
The trick with formatting a large block of code is to use triple backticks before and after. I've fixed your comment. You can re-edit it to see the backtick syntax.
I was thinking of adding exec
support to optimize_constants
in a generic way, but I don't think I can do it in a way that supports your code. All of your custom code with d['x']
and tp
, fp
, fn
would have to be replicated in the optimize_constants
file, so it wouldn't be generic.
So, let's see if we can make this part more generic first.
for i in range(x.shape[1]):
d['x'] = x[:, i]
exec(p, d)
y_p = d['result']
Here I think your idea is to pass one training instance in to p
at a time, right? I think it's the wrong way around -- normally in the Scikit-Learn convention, each row of the dataset is a single training instance. So should we have this instead?
for i in range(x.shape[0]):
d['x'] = x[i, :]
exec(p, d)
y_p = d['result']
Second, can you look at your grammar and phenotypes and see whether they can run in a vectorised way? Our supervised learning code assumes that everything is vectorised. So, the result of this would be:
d['x'] = x
exec(p, d)
y_p = d['result']
One common stumbling block in vectorisation is when you need to run if
-statements. The vectorised analogue of if
is numpy.where
: https://numpy.org/doc/stable/reference/generated/numpy.where.html. This is sometimes enough to turn an exec
situation back to an eval
situation.
Next, I see you want to get the tp
, fp
, and eventually f1
scores. But if we have vectors y
and y_p
, then we can calculate all of these in a vectorised way, rather than one at a time. Scikit-Learn gives code for f1
.
And indeed, f1
is already available as a PonyGE error metric: https://github.com/PonyGE/PonyGE2/blob/master/src/utilities/fitness/error_metric.py.
Thank you, @jmmcd.
About the x.shape[0]
comment, I agree with you: I usually treat each row of a dataset as a single training instance, but when I started to use PonyGE2 I found that x = self.training_in
, for instance, returns somehow a transposed version of the dataset, and this is why I ended up using x.shape[1]
and d['x'] = x[:, i]
.
About your second comment, I'll need some time to check if I can change my grammars in a reasonable time so that I can use a vectorised style. Thank you anyway for your help!
About the x.shape[0] comment, I agree with you: I usually treat each row of a dataset as a single training instance, but when I started to use PonyGE2 I found that x = self.training_in, for instance, returns somehow a transposed version of the dataset, and this is why I ended up using x.shape[1] and d['x'] = x[:, i].
Yes, you're right. I've made a new issue #130 for that discussion. For this issue, let's continue as-is.
I've made the change and closed #130. Happy to continue discussion here about vectorisation, or how to get things done using exec
.
Hello everyone! I'm using PonyGE2 with pybnf grammars, in which I have a number of constants. I stumbled across "optimize_constants" here, and it looks interesting, but I'm afraid it works just for "evaluable" phenotypes, not for "executable" ones. Am I wrong? If I am, does anybody have more detailed information to share so I can make it work? Instead, if I'm not, is there any short-term plan to make optimisation of constants work with executable phenotypes too? Thanks in advance!