facebook / Ax

Adaptive Experimentation Platform
https://ax.dev
MIT License
2.31k stars 294 forks source link

[Question] How to add fixed parameter choices in GenerationStrategy? #2506

Open StevePny opened 1 month ago

StevePny commented 1 month ago

Hi,

I'm relatively new to Ax and looking for an answer to a particular use case - For context, we're using a custom implementation of the slurm API to optimize a large model running on AWS Parallel Cluster. We have some baseline parameterizations that we'd like to use to initialize the optimization.

To do this, I'd like to be able to use a user-defined generation strategy as in: https://ax.dev/tutorials/generation_strategy.html#1A.-Manually-configured-generation-strategy

gs = GenerationStrategy(
    steps=[
        # 1. Initialization step (does not require pre-existing data and is well-suited for
        # initial sampling of the search space)
        GenerationStep(
            model=Models.SOBOL,
            num_trials=5,  # How many trials should be produced from this generation step
            min_trials_observed=3,  # How many trials need to be completed to move to next model
            max_parallelism=5,  # Max parallelism for this step
            model_kwargs={"seed": 999},  # Any kwargs you want passed into the model
            model_gen_kwargs={},  # Any kwargs you want passed to `modelbridge.gen`
        ),
        # 2. Bayesian optimization step (requires data obtained from previous phase and learns
        # from all data available at the time of each new candidate generation call)
        GenerationStep(
            model=Models.BOTORCH_MODULAR,
            num_trials=-1,  # No limitation on how many trials should be produced from this step
            max_parallelism=3,  # Parallelism limit for this step, often lower than for Sobol
            # More on parallelism vs. required samples in BayesOpt:
            # https://ax.dev/docs/bayesopt.html#tradeoff-between-parallelism-and-total-number-of-trials
        ),
    ]
)

via a 'Step 0' parameter set that uses a single 'status_quo' type experiment representative of our existing best tuned model.

I've tried, for example in a toy experiment, to manually override the experiment list using ax_client.attach_trial(parameters={"rho":_rho,"beta":3.0,"sigma":_sigma}), however during the optimization the training just says 'RUNNING' but never completes (I'm assuming because it is not part of the already-defined GenerationStrategy).

trial_index arm_name    trial_status    generation_method   result  rho beta    sigma
0   0   status_quo   RUNNING        Manual      NaN             28  3.000000    10
1   1   1_0          COMPLETED  Sobol       11.480003   28  6.287298    10
2   2   2_0          COMPLETED  Sobol       4.605844    28  2.778990    10
3   3   3_0          COMPLETED  Sobol       6.979540    28  0.981536    10
4   4   4_0          COMPLETED  BoTorch         13.565324   28  10.000000   10

So my primary question is: What is the preferred way to set up one or more pre-selected parameter sets for this kind of use-case, so that the pre-selected trials run first during the optimization?

A secondary question: Similarly, if there are offline runs that are not part of the optimization but, unlike the above question, we already have a "result" metric calculated, what is the recommended way to add this information?

e.g. this capability is offered in the https://smt.readthedocs.io/en/latest/ package via xt and yt in the EGO routine below:

#Efficient Global Optimization
ego = EGO(n_iter=n_iter,  #Number of optimizer steps
          criterion=criterion,
          xdoe=xt,        #initial points to start search
          ydoe=yt,        #initial point loss function outputs
          xtypes=xtypes,
          xlimits=xlimits,
          n_start=30,     #Number of optimization start points
          n_max_optim=35, #Maximum number of internal optimizations
          enable_tunneling=False,
          surrogate=sm,
          n_parallel = 4, #Number of parallel samples to compute using qEI criterion
          verbose = True)
mgrange1998 commented 3 weeks ago

Hello, thank you for opening this issue. Here is another issue about adding certain parameters you want to be evaluated before the exploration phase https://github.com/facebook/Ax/issues/136

params1, trial_index1 = ax.attach_trial(parameters={"x1": 0.0}
params2, trial_index2 = ax.attach_trial(parameters={"x1": 0.0}

# run your evaluation here...

ax.complete_trial(trial_index1, [data here])
ax.complete_trial(trial_index2, [data here])

In your case, it is likely you need to call "complete_trial" in order for the custom arm to go from RUNNING to COMPLETED. Let me know if this helps your issue

StevePny commented 3 weeks ago

Hi @mgrange1998, thanks for your reply. We're using something like the submitit tutorial, where there is a loop checking on the status of the submitted jobs. I was able to reproduce the problem there, so that would probably be the better place to focus the discussion.

Let's say then that after running ax_client.create_experiment(...) I add to the submitit.ipynb tutorial notebook:

params_baseline, trial_index_baseline = ax_client.attach_trial(parameters={"x": 0.0, "y": 0.0})

In this case, trial_index_baseline=0. In the tutorial it looks like the loop should cycle through jobs and complete them with ax_client.complete_trial(trial_index=trial_index, raw_data=result) in the code block below:

(For simplicity, I changed the num_parallel_jobs=1 and total_budget=4)

total_budget = 4
num_parallel_jobs = 1

jobs = []
submitted_jobs = 0
# Run until all the jobs have finished and our budget is used up.
while submitted_jobs < total_budget or jobs:
    for job, trial_index in jobs[:]:
        # Poll if any jobs completed
        # Local and debug jobs don't run until .result() is called.
        if job.done() or type(job) in [LocalJob, DebugJob]:
            result = job.result()
            ax_client.complete_trial(trial_index=trial_index, raw_data=result)
            jobs.remove((job, trial_index))

    # Schedule new jobs if there is availablity
    trial_index_to_param, _ = ax_client.get_next_trials(
        max_trials=min(num_parallel_jobs - len(jobs), total_budget - submitted_jobs))
    for trial_index, parameters in trial_index_to_param.items():
        job = executor.submit(evaluate, parameters)
        submitted_jobs += 1
        jobs.append((job, trial_index))
        time.sleep(1)

    # Display the current trials.
    display(exp_to_df(ax_client.experiment))

    # Sleep for a bit before checking the jobs again to avoid overloading the cluster. 
    # If you have a large number of jobs, consider adding a sleep statement in the job polling loop aswell.
    time.sleep(30)

However, the attached job is never evaluated and so never marked 'complete'. Instead, it looks like the attached trial is submitted simultaneously with the first SOBOL-generated trial and then never actually runs.

Screen Shot 2024-06-14 at 2 54 39 PM

It doesn't look like ax_client.get_next_trials() actually gets the attached trial with the 0 index. Instead it jumps straight to trial 1 produced by the Sobol method:

Screen Shot 2024-06-14 at 3 11 47 PM
StevePny commented 3 weeks ago

Hi @mgrange1998, I found a solution that seems to work, though it is a bit of a hack:

First add the trials:

# SGP added: test attached trials
params_baseline, trial_index_baseline = ax_client.attach_trial(parameters={"x": 0.0, 
                                                                           "y": 0.0})

Our optimization loop is actually in a function, so requires accessing the experiment data from the ax_client before starting the while loop above:

def optimization_loop(ax_client, model_run_func, executor, evaluate, total_budget=10, num_parallel_jobs=1):
    jobs = []
    submitted_jobs = 0
    # Run until all the jobs have finished and our budget is used up.

    # Check if the ax_client already has manually added trials
    if (ax_client.experiment.num_trials>0):
        for trial_index in ax_client.experiment.trials:
            trial = ax_client.experiment.trials[trial_index]
            parameters = trial.arm.parameters
            job = executor.submit(evaluate, parameters)
            submitted_jobs += 1
            jobs.append((job, trial_index_baseline))

    while submitted_jobs < total_budget or jobs:
    ...

Then when calling:

ax_client = optimization_loop(ax_client, model_run_func, executor, evaluate_basic, total_budget=total_budget, num_parallel_jobs=num_parallel_jobs)

It appears to run correctly and does not continue beyond the first trial until that first manually submitted trial completes. I did not test this with multiple attached trials though.

StevePny commented 3 weeks ago

Do you have a suggestion for my second question? "if there are offline runs that are not part of the optimization but, unlike the above question, we already have a "result" metric calculated, what is the recommended way to add this information?"