EmuKit / emukit

A Python-based toolbox of various methods in decision making, uncertainty quantification and statistical emulation: multi-fidelity, experimental design, Bayesian optimisation, Bayesian quadrature, etc.
https://emukit.github.io/emukit/
Apache License 2.0
605 stars 128 forks source link

categorical one hot encoding #319

Closed ghost closed 3 years ago

ghost commented 4 years ago

calling sample_uniform on a onehot encoded category returns the encoded 1d array and not the category value, which breaks the space sample_uniform

apaleyes commented 4 years ago

That doesn't sound very nice, thanks for reporting. Would you mind sharing the code you used to reproduce this issue?

ghost commented 4 years ago
from emukit.core import CategoricalParameter, OneHotEncoding, OrdinalEncoding, ParameterSpace
from emukit.core.loop import RandomSampling, LoopState
cat_one = ["up", "down", "left", "right"]
cat_two = ["low", "medium", "high", "superhigh"]
space = [CategoricalParameter(name='one', encoding=OneHotEncoding(cat_one)),
         CategoricalParameter(name='two', encoding=OrdinalEncoding(cat_two)) ]
space = ParameterSpace(space)
loop = LoopState([])
points = RandomSampling(space)
points.compute_next_points(loop)
# array([[1., 0., 0., 0., 4.]])

The next points are passed to the evaluate method of the target function so I assume they should not be the encoded values, unless I am using it wrong? (since also ordinal values are not converted, so it's a problem with encodings in general)

btw,

# onehot with numpy
encodings = np.zeros(len(categories))
np.fill_diagonal(encodings, 1)

# ordinal..just np.arange(len(categories) (?)
apaleyes commented 4 years ago

Yes, the output isn't decoded. That's intentional decision, because Emukit does not do modelling, and expects model as an input. We have no control over the way X is being put into the model. Therefore, to be as unopinionated as possible we decided to avoid doing encoding/decoding as a part of emukit's pipeline. That's essentially the trade-off between convenience and confusing behavior (which we could really fall into trying to cater for all possible use cases).