Handling of Inactive Hyperparameters in SMAC’s Surrogate Models

simonprovost commented 11 months ago

I would like to express my appreciation for the outstanding work done on the new iteration of the SMAC framework. It is a useful framework to use. However, I would like more information regarding the management of inactive hyperparameters when input data is provided to the random forest regression surrogate model during the SMAC procedure.

Description

I have a year ago obviously encountered the concept of inactive hyperparameters while perusing numerous papers and engaging in discussions around. By inactive hyperparameters, while this is a popular name for it, let's be precise, I mean hyperparameters that are not conditional on specific configurations. For instance, while the decision tree and random forest algorithms share some hyperparameters, the n_estimators hyperparameter is unique to the random forest algorithm; therefore, the decision tree algorithm should never be sampled nor used by the surrogate in the SMAC's process.

While I am certain that SMAC do so as is, I am, however, much more perplexe as to how SMAC manages these inactive hyperparameters. My current understanding is that SMAC do so in a manner that leaves their (inactive hyperparameters) imputation to the surrogate model itself given it could be managed differently by Gaussian processes than Random forest for instance. Consequently, the surrogate model, a regression random forest, as of interest, receives input data in which rows represent configurations whose cost values are known and columns represent hyperparameters from the search space. If I am not mistaken, inactive hyperparameters for specified configurations appear to be pre-represented (prior to run anything per the surrogate model) by specific placeholder values such as NaN.

As a result, the surrogate itself, in this case the random forest regression, I have observed that categorical hyperparameters use the length of the possible choices as per the hyperparameter as the placeholder, whereas float/integer hyperparameters use -1 and constant 1 respectively.

Questions:

Most important:

How are inactive hyperparameters in the dataset provided to the random forest's surrogate in SMAC managed precisely? Are they executed as described above? Following this Github issue, would you be open to a PR in the FAQ explaining how they are handled?
Why is the number of options the choices offer is used to represent inactive categorical hyperparameters? What is the logic behind this decision?
Why is -1 used for inactive float/integer hyperparameters, and what effect does this decision have on the model? Is -1 not regarded as one of the options? Or, as I have observed elsewhere, float/integer hyperparameters are rescaled? If so, could you please provide further explanation for this type of inactive hyperparameters?

Less important:

Were there any considerations for modifying the decision tree splitting criteria to manage inactive hyperparameters based on a flag or other manners, as opposed to using placeholders to falsify the decision tree with almost to no information gain for these hyperparameters?
Would you confirm that following PR on the Scikit learn official library is not going to influence with these potential inactive hyperparameters that can be actually seen as missing values ? Given your extra layer of missing value imputation, I reckon this to be not an issue, yet always great to extra confirm.
Is there a method to print the input data given to the surrogate using the API? In order for us to have a visual interpretation. If not, could we be directed to a good starting point for printing in the code following a fork of SMAC?

Steps/Code to Reproduce

Link to the part of the code mentionned that impute missing hyperparameters for the random forest surrogate: GitHub

_Note that my initial confusion is this docstring that says Impute inactive hyperparameters in configurations with their default yet the code is not inputing as per my understanding, rather it just is filling out a configuration(i.e, list) of the hyperparameters: return np.array([config.get_array() for config in configs], dtype=np.float64)_

Versions

MacOSX Sonoma (14.0)
Latest stable SMAC3's version

dengdifan commented 11 months ago

Hi @simonprovost thanks for the information.

How are inactive hyperparameters in the dataset provided to the random forest's surrogate in SMAC managed precisely? Are they executed as described above? Following this Github issue, would you be open to a PR in the FAQ explaining how they are handled?

The main reason that we impute NaN value for RF is that our RF surrogate model is built based on the pyrfr package written in C++ and wrapped by swig. NaN values might not be easily transferred to the corresponding C++ variants in swig. Therefore we need to impute those values.

Why is the number of options the choices offer is used to represent inactive categorical hyperparameters? What is the logic behind this decision?

In ConfigSpace, categorical HPs are encoded as Numerical Encoding ([0, 1, 2, ... n_opts - 1]), as shown in this line. Therefore, a categorical HP will never select a HP whose vector value is n_opts if it is an activate HP

Why is -1 used for inactive float/integer hyperparameters, and what effect does this decision have on the model? Is -1 not regarded as one of the options? Or, as I have observed elsewhere, float/integer hyperparameters are rescaled? If so, could you please provide further explanation for this type of inactive hyperparameters?

Similar to categorical HPs, Numerical HPs (float & int) are represented as vectors within [0,1] (This normalization method is also used in GP models). Therefore, they will never select -1 if the HPs are activate

Were there any considerations for modifying the decision tree splitting criteria to manage inactive hyperparameters based on a flag or other manners, as opposed to using placeholders to falsify the decision tree with almost to no information gain for these hyperparameters?

As soon as our surrogate models are based on pyrfr, this might not be easily implemented.

Would you confirm that following https://github.com/scikit-learn/scikit-learn/pull/23595is not going to influence with these potential inactive hyperparameters that can be actually seen as missing values ? Given your extra layer of missing value imputation, I reckon this to be not an issue, yet always great to extra confirm.

We also considering reimplementing our RF models based on SKLearn's models, however, I cannot promise the exact time that would happen.,

Is there a method to print the input data given to the surrogate using the API? In order for us to have a visual interpretation. If not, could we be directed to a good starting point for printing in the code following a fork of SMAC?

for a configuration, you can simply call config.get_array() to get its numerical representation.

Hopes that answers all your questions

simonprovost commented 11 months ago

Hi @dengdifan,

I greatly appreciate your detailed response; it has helped clarify numerous aspects. I understand that placeholder values outside the range of their active counterparts are assigned to inactive hyperparameters to prevent them from significantly influencing the surrogate model.

From the discussion, it appears that the placeholder values are not likely to be selected or to lead to meaningful splits in the decision trees of the surrogate model. E.g., Due to the uniformity and lack of correlation between these placeholders and the target values, the information gain from splitting on these values is typically low, particularly when working with a configuration-based dataset that is densely populated.

Nonetheless, I am curious about two things:

Conceptually speaking this is confusing to not use the same placeholder while however you encode every HPs, is there a reason to that?
Whether explicitly preventing splits on inactive hyperparameters could be advantageous, in terms of potential forbidden split and tree construction time efficiency. If, for example, the trees of the surrogate model could be constrained to only consider splits within the valid range of active hyperparameter values, this could prevent the improbable but possible occurrence of a less informative split based on placeholder (i.e., inactive HP) values. In addition, this may be one more reason to investigate the use of Scikit-learn's trees, as you mentioned there are considerations to reimplement the RF surrogate model using Scikit-learn. Nonetheless, (from your experience) have there been observations (in practice) in which splits occurred on inactive hyperparameters or information gain on the other (active) HPs have always produced higher information gain?

In the meantime, thank you again for your insights, and I eagerly await your response to the final question. Additionally, future reader should find it useful for comprehending and possibly enhancing the handling of inactive hyperparameters in SMAC. Therefore, I used the following snippet beginning with their unit testing if that could help visually visualise the imputed missing values that the surrogate RF of SMAC do, so that you can see roughly how it is done, although this is a very simplified example:

# Import the necessary here (ConfigSpace, SMAC, Rich, etc.)

def display_hyperparameter_configurations(size=10):
    def convert_configurations_to_array(configs):
        return np.array([config.get_array() for config in configs])

    # Define the configuration space
    cs = ConfigurationSpace(seed=0)

    # Algorithm hyperparameter
    algorithm = cs.add_hyperparameter(CategoricalHyperparameter("algorithm", ["decision_tree", "random_forest"]))

    # Decision Tree hyperparameters
    criterion = cs.add_hyperparameter(CategoricalHyperparameter("criterion", ["gini", "entropy"]))
    max_depth = cs.add_hyperparameter(UniformIntegerHyperparameter("max_depth", 1, 20))

    # Conditions for Decision Tree hyperparameters
    cs.add_condition(EqualsCondition(criterion, algorithm, "decision_tree"))
    cs.add_condition(EqualsCondition(max_depth, algorithm, "decision_tree"))

    # Random Forest hyperparameters
    n_estimators = cs.add_hyperparameter(UniformIntegerHyperparameter("n_estimators", 10, 200))
    max_features = cs.add_hyperparameter(CategoricalHyperparameter("max_features", ["auto", "sqrt", "log2"]))

    # Conditions for Random Forest hyperparameters
    cs.add_condition(EqualsCondition(n_estimators, algorithm, "random_forest"))
    cs.add_condition(EqualsCondition(max_features, algorithm, "random_forest"))

    # Sample configurations
    configs = cs.sample_configuration(size=size)
    config_array = convert_configurations_to_array(configs)

    model = RandomForest(configspace=cs)
    config_array = model._impute_inactive(config_array)

    hp_names = [hp.name for hp in cs.get_hyperparameters()]

    console = Console()
    table = Table(show_header=True, header_style="bold magenta")
    for name in hp_names:
        table.add_column(name)
    for config in config_array:
        table.add_row(*map(str, config))

    console.print(table)

# Call the function to display the configurations
display_hyperparameter_configurations(size=50)

Following your answer @dengdifan , consider this issue done 👍 Cheers.

simonprovost commented 11 months ago

Given the non high priorities of the remaining queries, I'll close to let other more important query to pass first. Please feel free to reopen if you have time or if any reader wishes to learn more about the two most recent questions posed.

Cheers!

automl / SMAC3