facebook / Ax

Adaptive Experimentation Platform
https://ax.dev
MIT License
2.37k stars 310 forks source link

Automate selection of appropriate parameters for BoTorch components in Ax based on experiment and data size #674

Closed josalhor closed 3 years ago

josalhor commented 3 years ago

Hello again (I hope I am not causing too much trouble to the team :) ),

I am here to report a possible bug. Attaching many trials through the Service API consumes a lot of memory. Here you have an example: (Warning: A couple of machines started thrashing and then froze with the following code).

from ax.service.ax_client import AxClient
from ax.service.utils.instantiation import ObjectiveProperties
import itertools

def evaluate(args):
    return {
        'a': (5_000, 0.0),
    }

ax_client = AxClient(random_seed=64)
ax_client.create_experiment(
    name="ax_err",
    parameters=[
        {'name': 'p1', 'type': 'range', 'bounds': [0, 5000], 'value_type': 'int'},
        {'name': 'p2', 'type': 'range', 'bounds': [0, 6000], 'value_type': 'int'},
        {'name': 'p3', 'type': 'range', 'bounds': [0, 7000], 'value_type': 'int'},
    ],
    objectives={
        'a': ObjectiveProperties(minimize=True, threshold=10_000)
    },
)

def range_float(stop, percent=10/100):
    l = []
    c = 0
    while c < stop:
        l.append(int(c))
        c += percent * stop
    return l

r_p1 = range_float(5000)
r_p2 = range_float(6000)
r_p3 = range_float(7000)

force_trials = []
for p1, p2, p3 in itertools.product(r_p1, r_p2, r_p3):
    config, trial_index = ax_client.attach_trial({
        'p1': p1,
        'p2': p2,
        'p3': p3
    })
    evaluations = evaluate(config)
    ax_client.complete_trial(trial_index=trial_index, raw_data=evaluations)

for _ in range(15):
    (config, trial_index) = ax_client.get_next_trial()
    evaluations = evaluate(config)
    ax_client.complete_trial(trial_index=trial_index, raw_data=evaluations)

I've replicated this issue with MOO. Reducing the percentage (e.g to 20/100) does not cause this behavior. If I had to guess, it does not appear to be a memory leak. I think that some internal operation of ax/botorch has very steep memory complexity.

Balandat commented 3 years ago

So if I am reading this correctly you're generating 0.1 * 5000 * 0.1 * 6000 * 0.1 * 7000 = 210 * 10^6 trials here? This is a lot for any method, let alone Bayesian Optimization. Since we're computing pairwise distances this would need to store a Kernel matrix with 441 trillion elements (even before computing any predictions), which is a lot...

The methods Ax uses are simply not designed for this kind of problem, if it's indeed feasible for you to make that many evaluations you'll probably want to look into some other methods (e.g. local optimization w/o surrogate models, or genetic algorithms) instead. Ax generally targets problems where maybe a few hundred evaluations are feasible.

josalhor commented 3 years ago

So if I am reading this correctly you're generating 0.1 * 5000 * 0.1 * 6000 * 0.1 * 7000 = 210 * 10^6 trials here?

Sorry, that example probably wasn't the best. range_float generates a list and the step of the range is calculated with a percentage. So it is 10^n with n=3. It is actually just 1000 trials.

Since we're computing pairwise distances

That makes a lot of sense. But what I am describing should be about 1 million elements for the kernel. Is that too much?

Balandat commented 3 years ago

It is actually just 1000 trials.

I see. So that's reasonable, but generally at the upper end of what we typically handle with our default models (not from a memory - a 1M kernel matrix is just 8MB - but from a computational perspective). I am wondering whether there is some other thing that causes a high memory footprint (e.g. caching), it would be good to do some profiling here.

josalhor commented 3 years ago

In case it helps, I've narrowed it down to the following call stack:

... ?
_call_impl (......\venv\Lib\site-packages\torch\nn\modules\module.py:1051)
gen_batch_initial_conditions (......\venv\Lib\site-packages\botorch\optim\initializers.py:149)
optimize_acqf (......\venv\Lib\site-packages\botorch\optim\optimize.py:166)
scipy_optimizer (......\venv\Lib\site-packages\ax\models\torch\botorch_defaults.py:322)
make_and_optimize_acqf (......\venv\Lib\site-packages\ax\models\torch\botorch.py:382)
gen (......\venv\Lib\site-packages\ax\models\torch\botorch.py:396)
_model_gen (......\venv\Lib\site-packages\ax\modelbridge\torch.py:207)
_gen (......\venv\Lib\site-packages\ax\modelbridge\array.py:274)
gen (......\venv\Lib\site-packages\ax\modelbridge\base.py:669)
_gen_multiple (......\venv\Lib\site-packages\ax\modelbridge\generation_strategy.py:509)
gen (......\venv\Lib\site-packages\ax\modelbridge\generation_strategy.py:405)
_gen_new_generator_run (......\venv\lib\site-packages\ax\service\ax_client.py:1095)
get_next_trial (......\venv\lib\site-packages\ax\service\ax_client.py:327)
actual_wrapper (......\venv\Lib\site-packages\ax\utils\common\executils.py:147)
<module> (......\src\err_ax.py:50)

The call to forward on qNoisyExpectedImprovement blows up the memory consumption. It happens on the call to get_next_trial on the 7th iteration after attaching the trials, on model fitting. I cannot make sense of the call stack past that point.

lena-kashtelyan commented 3 years ago

@Balandat, should we transfer this issue to the BoTorch repo, do you think?

Balandat commented 3 years ago

I don't think it's a BoTorch issue per se, it's just that this is using the default values for random restarts and raw samples for generating initial conditions on some training data of very large size. In BoTorch these parameters can be chosen, so me may want to revisit how they are chosen in Ax in the large-data regime.

@josalhor what do you mean by a lot of memory? Can you be a bit more specific? In general BoTorch heavily exploits data parallelism, so we generally make the tradeoff of using more memory for less wall time, but of course this should happen within reason.

josalhor commented 3 years ago

what do you mean by a lot of memory? Can you be a bit more specific?

This is peak memory consumption adjusting the percent variable.

Percent Attached Trials Peak Mem
20 / 100 123 1,5G
15 / 100 341 7G
12 / 100 727 > 21G (Killed manually at that point)

This memory consumption comes in two phases. Here is an screenshot for the 15 / 100 entry:

Image 15/20

Here it is for the 12 / 20 one (manually killed):

Image 12/20

I've seen runs where the valley between the two phases is way less pronounced. It may be an issue that comes from both high memory consumption and Garbage Collection weirdness.

lena-kashtelyan commented 3 years ago

Hi @josalhor, sorry for delay on this! Let me split up this issue into two:

  1. how to manually limit memory consumption of qNEI for large-data settings (will show this below),
  2. automate selection of appropriate parameters for BoTorch components in Ax based on experiment and data size (this is something we would like to do, but in the long-term, so I'll be sending this to our wishlist master issue).

To limit memory consumption of qNEI, the easiest way it probably to use our modular BotAx setup (so Models.BOTORCH_MODULAR instead of Models.GPEI that AxClient is using for you under the hood right now). To do so, you'll need to:

  1. Construct a generation strategy that will use Models.BOTORCH_MODULAR and pass acquisition function options to it (check out using Models.BOTORCH_MODULAR in generation strategies section of the modular BotAx tutorial for instructions);
    1. As part of model_kwargs for the BoTorch generation step, specify "acquisition_options" to something like {"optimizer_options": {"num_restarts": 10, "raw_samples": 256}},
    2. For more details on generation strategies and their options, check out the generation strategy tutorial.
  2. Pass the resulting generation strategy to AxClient via AxClient(generation_strategy=...).

You should end up with something like this:

from ax.modelbridge.generation_strategy import GenerationStep, GenerationStrategy
from ax.modelbridge.registry import Models

gs = GenerationStrategy(
    steps=[
        GenerationStep(  # Initialization step
            # Which model to use for this step
            model=Models.SOBOL,
            # How many generator runs (each of which is then made a trial) 
            # to produce with this step
            num_trials=5,
            # How many trials generated from this step must be `COMPLETED` 
            # before the next one
            min_trials_observed=5, 
        ),
        GenerationStep(  # BayesOpt step
            model=Models.BOTORCH_MODULAR,
            # No limit on how many generator runs will be produced
            num_trials=-1,
            model_kwargs={  # Kwargs to pass to `BoTorchModel.__init__`
                 "acquisition_options": {"optimizer_options": {"num_restarts": 10, "raw_samples": 256}}
            },
        )
    ]
)

ax_client = AxClient(generation_strategy=gs)

Let us know if this doesn't work for you! With this, I'll consider part 1 of the issue resolved and will mark it as wishlist for part 2.

sgbaird commented 2 years ago

Maybe worth mentioning I think you can also specify it as the following based on https://ax.dev/api/_modules/ax/models/torch/alebo.html#ALEBO.gen, but maybe that only applies to ALEBO:

GenerationStep(  # BayesOpt step
            model=Models.BOTORCH_MODULAR,
            # No limit on how many generator runs will be produced
            num_trials=-1,
            model_gen_kwargs={"num_restarts": 10, "raw_samples": 256},
            },
        )