facebook / Ax

Adaptive Experimentation Platform
https://ax.dev
MIT License
2.38k stars 311 forks source link

Performance issue #2716

Closed ivarbratberg closed 2 months ago

ivarbratberg commented 2 months ago

Question

"Hi, why does it takes like 20 minutes to run a simple optimization"

I am running a toy model to optimize, 6 unknowns, and the formula to optimize is (x[0]-1)2 + (x[1]-1)2 I am running in batches like this

I am running this in docker and I see it utilizes 20 CPUs on a rather powerfull PC. The example is run in Jupyter. If I use ex Optuna library doing Bayesian optimization it goes way faster The covariance matrix should only be like 6 x 6, so matrix inversion or finding eigenvalues I think cannot explain this ? Best regards Ivar Bratberg

Please provide any relevant code snippet if applicable.

for i in range(15):
    gpei = Models.BOTORCH_MODULAR(experiment=exp, data=table)
    generator_run = gpei.gen(n=3)
    trial = exp.new_batch_trial(generator_run=generator_run)
    trial.run()
    trial.mark_completed()

Code of Conduct

ivarbratberg commented 2 months ago

I think it was my mistake. I did not remember to make a new experiment before optimizing. After making sure to start with a new Experiment object each time, it goes in one second.

ivarbratberg commented 2 months ago

Sorry , but I see it still takes a lot of time. It takes like 4 minutes for 10 iterations, even when I start with empty experiment. Is this expected for such a small example ?

mgarrard commented 2 months ago

Hi @ivarbratberg thanks for reaching out - since Ax typically expedites the overall experimentation time by leveraging BO to hone in on the outcome of interest, the time to run is usually a worthwhile trade-off. Would love to understand the time frame that would be acceptable for your usecase, and any particularities that make your usecase time-sensitive to see if we can help :)

Balandat commented 2 months ago

@ivarbratberg could you please provide the full code for the repro? From the code you shared I am not sure you're using the package correctly.

ivarbratberg commented 2 months ago

Thanks a lot for answering! I agree that sometimes the time of the experiment will be dominating as is sometimes the case for my usage. However sometimes the experiments can be down to like 10 seconds and with many experiments run in parallel. In this example, with 6 parameters to adjust, I cannot understand why is should take 20 CPUs more than some seconds to compute the posterior and/or finding the eigenvalues, so I wonder what takes so long time. Below is the essential code, showing what I have tested. It is not running out of the box as I have copied pasted a bit, but it should show I have used the AX library. I run this with 6 free parameters to optimize, although as you can see, I am only using two of them in the formula to optimize

class CM(Metric):
"""A metric defined by a generic deterministic function, with normal noise with mean 0 and mean_sd scale added to the result. """

def __init__(
    self,
    name: str,
    lower_is_better: Optional[bool] = None,
) -> None:

    super().__init__(name=name, lower_is_better=lower_is_better)

@classmethod
def is_available_while_running(cls) -> bool:
    return True

def clone(self) -> CM:
    return self.__class__(
        name=self._name,
        lower_is_better=self.lower_is_better,
    )

def fetch_trial_data(self, trial: BaseTrial, noisy: bool = True, **kwargs: Any) -> MetricFetchResult:
    try:
        arm_names = []
        mean = []
        ix = -1
        for name, arm in trial.arms_by_name.items():
            ix += 1
            arm_names.append(name)
            val = trial.run_metadata["y"][ix]
            mean.append(val)

        df = pd.DataFrame(
            {
                "arm_name": arm_names,
                "metric_name": self.name,
                "mean": mean,
                "sem": 0,
                "trial_index": trial.index,
                "n": 10000 / len(arm_names),
                "frac_nonnull": mean,
            }
        )

        return Ok(value=Data(df=df))

    except Exception as e:
        return Err(MetricFetchE(message=f"Failed to fetch {self.name}", exception=e))

class MyRunner(Runner): def run(self, trial):

trial_metadata = {"name": str(trial.index)}

    y = []
    for arm in trial.arms:
        y.append(float(arm.parameters["x1"] - 1) ** 2 + float(arm.parameters["x2"] - 2) ** 2)
    trial_metadata = {"y": y}
    return trial_metadata

def _create_ax(control_parameters, minimize, nr_initial_random_samples, max_parallelism, objective_parameter_name, control_parameter_names, job): parameters = [] for name, cp in control_parameters.items(): parameters.append(RangeParameter(name=name, parameter_type=ParameterType.FLOAT, lower=cp["lowerBound"], upper=cp["upperBound"])) search_space = SearchSpace(parameters=parameters)

    optimization_config = OptimizationConfig(
        objective=Objective(
            metric=CM(name="CM"),
            minimize=minimize,
        )
    )

    runner = MyRunner()
    MyRunner.objective_parameter_name = objective_parameter_name
    MyRunner.control_parameter_names = control_parameter_names
    MyRunner.job = job
    experiment = Experiment(
        name="test",
        search_space=search_space,
        optimization_config=optimization_config,
        runner=runner,
    )

    # https://ax.dev/tutorials/generation_strategy.html
    generation_strategy = GenerationStrategy(
        steps=[
            # 1. Initialization step (does not require pre-existing data and is well-suited for
            # initial sampling of the search space)
            GenerationStep(
                model=Models.SOBOL,
                num_trials=nr_initial_random_samples,  # How many trials should be produced from this generation step
                min_trials_observed=nr_initial_random_samples,  # How many trials need to be completed to move to next model
                max_parallelism=max_parallelism,  # Max parallelism for this step
                model_kwargs={"seed": 999},  # Any kwargs you want passed into the model
                should_deduplicate=True,
                model_gen_kwargs={},  # Any kwargs you want passed to `modelbridge.gen`
            ),
            # 2. Bayesian optimization step (requires data obtained from previous phase and learns
            # from all data available at the time of each new candidate generation call)
            GenerationStep(
                model=Models.BOTORCH_MODULAR,
                num_trials=-1,  # No limitation on how many trials should be produced from this step
                max_parallelism=max_parallelism,  # Parallelism limit for this step, often lower than for Sobol
                should_deduplicate=True,
                # More on parallelism vs. required samples in BayesOpt:
                # https://ax.dev/docs/bayesopt.html#tradeoff-between-parallelism-and-total-number-of-trials
            ),
        ]
    )

    return experiment, generation_strategy

experiment, generation_strategy = self._create_ax( control_parameters, minimize, nr_initial_random_samples, num_sample_points_per_iteration, objective_parameter_name, control_parameter_names, self.job, )

    num_iterations_without_progress = 0
    num_evaluations = 0
    num_evaluations_without_progress = 0
    fbest_previous = np.infty

    iteration_id = -1
    data = None
    while True:
        iteration_id = iteration_id + 1
        generator_run = generation_strategy.gen(n=num_sample_points_per_iteration, experiment=experiment, data=data)
        trial = experiment.new_batch_trial(generator_run=generator_run)
        data = experiment.fetch_data()
        trial.run()
        trial.mark_completed()
       .. 
       # rest of the code to control progress and stopping
Balandat commented 2 months ago

What are the values for nr_initial_random_samples and num_sample_points_per_iteration?

It's quite challenging for us to provide proper support if we don't have a fully reproducible & runnable code example.

Also, is there a specific reason you're using the "developer API" rather than the "service API" (https://ax.dev/tutorials/gpei_hartmann_service.html)? It doesn't look like you're doing anything that would require the use of the developer API.

ivarbratberg commented 2 months ago

Thanks, nr_initial_random_samples = 8, and they are all done very fast, as I would expect, not a lot of processing happening there, even for Sobol. num_sample_points_per_iteration = 3.

To be honest the code I ended up with is a bit random. I started out with the examples in the tutorial, but I could not make them run, it seems like the examples are not unit tested ? I had to dig a bit in the code, trying different examples, and ended up with this as the first working code where I had control of the GenerationStrategy and the batching. This was two criterias for me, as I intend to build further on this code, to experiment with different algorithms and batching.

I will recreate the example with Optuna, BayesianSampling to compare time usage.

Thanks a lot for the response and will to help me out

ivarbratberg commented 2 months ago

Now I have compared with Optuna, Optuna used 17 seconds, and Ax used 117 seconds. Optuna converged quicker it seems, but I guess that is because it is preconfigured to be in general good, and it is not customizable. I will try to configure Bayesian on Ax to use the same algorithms and hyper parameters as on Optuna, and compare again

Here is the Optuna code:

import optuna

sampler = optuna.samplers.GPSampler(n_startup_trials=10)
es = optuna.create_study(direction="minimize", sampler=sampler)
minimize = True
y_array = []
for iteration_id in range(25):
    iteration_id = iteration_id + 1

    x_evaluate = []
    trials = []
    for _ in range(3):
        trial = es.ask()
        x = {
            k: trial.suggest_float(k, -1, 1)
            for k in range(6)
        }
        x_evaluate.append(x.values())
        trials.append(trial)

    y_val = [sum([(xx-1)**2 for xx in x]) for x in x_evaluate]
    y_array.append(y_val)

    for trial, y_val in zip(trials, y_val):        
        es.tell(trial, y_val if minimize else -y_val)
ivarbratberg commented 2 months ago

Right, I guess this post can explain some of the efficiency gap between Optuna and Ax-Platforms Bayesian ? https://medium.com/optuna/introducing-optunas-native-gpsampler-0aa9aa3b4840

mgarrard commented 2 months ago

Sorry, @ivarbratberg could you provide some more details about this: I started out with the examples in the tutorial, but I could not make them run, it seems like the examples are not unit tested?

Ideally an example of any failures

ivarbratberg commented 2 months ago

Hi, I am sorry, I should not have written that. I must have done something wrong when I was incrementally trying to to use the tutorial code by pasting section for section. When I downloaded the complete code example and run it on Google Colab the tutorials all worked fine. Thank you a lot for your attention and answers, it was incredible to have replies so fast. I think I can close this question for now.