automl / ConfigSpace

Domain specific language for configuration spaces in Python. Useful for hyperparameter optimization and algorithm configuration.
https://automl.github.io/ConfigSpace/
Other
202 stars 93 forks source link

`np.array` on `Configuration` returns names of HPs instead of values #363

Open benjamc opened 4 months ago

benjamc commented 4 months ago

When calling np.array on Configuration it returns names of HPs instead of values. Is that intended? ConfigSpace 0.7.1

MWE:

import numpy as np
from ConfigSpace import ConfigurationSpace, Float

cs = ConfigurationSpace()
for i in range(3):
    cs.add_hyperparameter(Float(f"x_{i}", (0, 1)))

config = cs.sample_configuration()
print("output:", np.array(config))
print("expected:", config.get_array())

Output:

output: ['x_0' 'x_1' 'x_2']
expected: [0.09972031 0.91072326 0.16544557]
eddiebergman commented 4 months ago

Short answer, yes, intended to give you the keys, use np.array(config.values()) is you want the unnormalized values in a numpy array.

Sorry for typos, phone typing...


np.array specification relies on the ad-hoc __array__ protocol, implemented on things like pandas dataframes, torch tensors and others, to efficiently do array stuff.

However things like a python list don't have this, or others things like a python list. I'm sure calling np.array([1,2,3]) does something smart to pull out the values 1,2,3 but users can also implement their own list-like (Sequence/MutableSequence), in which the only thing you can do is iterate it. Might look something like this:

def array(x: Any) -> np.ndarray:
    if hasattr(x, "__array__"):
        # follow protocol
    elif isinstance(x, (list, tuple, builtin-python-thing)):
        # do some low level Cpython manipulation
    elif isinstance(x, Sequence):
        # user implement list like, can't really do better than this
        x_data = x[:len(x)]
        return array(x_data)
    elif isinstance(x, Iterable):
        x_data = [e for e in x]
        return array(x_data)
    else:
        # ....

Now the main point, Configuration is a Mapping (dict-like) and so in this setup, it would match the Iterable statement. Basically np.array can't do anything smart with a Mapping and so it defaults to using __iter__ on it. Basically the behaviour matches that of calling list() on a dict, which iterates throughs the keys

I would argue the main use case of a Configuration is that it behaves more like a dict than a vector, and so making it act like a Sequence doesn't make sense. Further, putting the unnormalized values into a numpy array can contain strings, floats, ints, and soon arbitrary values, i.e. doesn't make much sense for an array. Could argue about putting the normalized values in there but then that's really far from the common use case of a Configuration.


Had some time and did this on my phone but could you check some stuff for me?

pd.Series acts like a dict (kinda), i.e. heterogenous key-value pairs... But it's also a library that implements the __array__ protocol. What happens when you do np.array(pd.Series({"a":1, "b":2}))?

If it gives you an array of [1, 2], I could be persuaded to look into the array protocol so what you posted works, otherwise if it's gives ["a", "b"] or an error, I would stick to keeping the behaviour as it would normally be for a Mapping, even in the case of there being some vectorized format available.

benjamc commented 4 months ago

Thank you for the explanation, makes sense! Feel free to close the issue.

Running np.array(pd.Series({"a":1, "b":2})) yields array([1, 2]).

eddiebergman commented 2 months ago

We've decided this will do as you expected at the top of this issue! Will get to it when we have time :)

benjamc commented 2 months ago

Awesome!