Closed josalhor closed 3 years ago
So if I am reading this correctly you're generating 0.1 * 5000 * 0.1 * 6000 * 0.1 * 7000 = 210 * 10^6
trials here? This is a lot for any method, let alone Bayesian Optimization. Since we're computing pairwise distances this would need to store a Kernel matrix with 441
trillion elements (even before computing any predictions), which is a lot...
The methods Ax uses are simply not designed for this kind of problem, if it's indeed feasible for you to make that many evaluations you'll probably want to look into some other methods (e.g. local optimization w/o surrogate models, or genetic algorithms) instead. Ax generally targets problems where maybe a few hundred evaluations are feasible.
So if I am reading this correctly you're generating
0.1 * 5000 * 0.1 * 6000 * 0.1 * 7000 = 210 * 10^6
trials here?
Sorry, that example probably wasn't the best. range_float
generates a list and the step of the range is calculated with a percentage. So it is 10^n
with n=3
. It is actually just 1000 trials.
Since we're computing pairwise distances
That makes a lot of sense. But what I am describing should be about 1 million elements for the kernel. Is that too much?
It is actually just 1000 trials.
I see. So that's reasonable, but generally at the upper end of what we typically handle with our default models (not from a memory - a 1M kernel matrix is just 8MB - but from a computational perspective). I am wondering whether there is some other thing that causes a high memory footprint (e.g. caching), it would be good to do some profiling here.
In case it helps, I've narrowed it down to the following call stack:
... ?
_call_impl (......\venv\Lib\site-packages\torch\nn\modules\module.py:1051)
gen_batch_initial_conditions (......\venv\Lib\site-packages\botorch\optim\initializers.py:149)
optimize_acqf (......\venv\Lib\site-packages\botorch\optim\optimize.py:166)
scipy_optimizer (......\venv\Lib\site-packages\ax\models\torch\botorch_defaults.py:322)
make_and_optimize_acqf (......\venv\Lib\site-packages\ax\models\torch\botorch.py:382)
gen (......\venv\Lib\site-packages\ax\models\torch\botorch.py:396)
_model_gen (......\venv\Lib\site-packages\ax\modelbridge\torch.py:207)
_gen (......\venv\Lib\site-packages\ax\modelbridge\array.py:274)
gen (......\venv\Lib\site-packages\ax\modelbridge\base.py:669)
_gen_multiple (......\venv\Lib\site-packages\ax\modelbridge\generation_strategy.py:509)
gen (......\venv\Lib\site-packages\ax\modelbridge\generation_strategy.py:405)
_gen_new_generator_run (......\venv\lib\site-packages\ax\service\ax_client.py:1095)
get_next_trial (......\venv\lib\site-packages\ax\service\ax_client.py:327)
actual_wrapper (......\venv\Lib\site-packages\ax\utils\common\executils.py:147)
<module> (......\src\err_ax.py:50)
The call to forward
on qNoisyExpectedImprovement
blows up the memory consumption. It happens on the call to get_next_trial
on the 7th iteration after attaching the trials, on model fitting. I cannot make sense of the call stack past that point.
@Balandat, should we transfer this issue to the BoTorch repo, do you think?
I don't think it's a BoTorch issue per se, it's just that this is using the default values for random restarts and raw samples for generating initial conditions on some training data of very large size. In BoTorch these parameters can be chosen, so me may want to revisit how they are chosen in Ax in the large-data regime.
@josalhor what do you mean by a lot of memory? Can you be a bit more specific? In general BoTorch heavily exploits data parallelism, so we generally make the tradeoff of using more memory for less wall time, but of course this should happen within reason.
what do you mean by a lot of memory? Can you be a bit more specific?
This is peak memory consumption adjusting the percent
variable.
Percent | Attached Trials | Peak Mem |
---|---|---|
20 / 100 | 123 | 1,5G |
15 / 100 | 341 | 7G |
12 / 100 | 727 | > 21G (Killed manually at that point) |
This memory consumption comes in two phases. Here is an screenshot for the 15 / 100
entry:
Here it is for the 12 / 20
one (manually killed):
I've seen runs where the valley between the two phases is way less pronounced. It may be an issue that comes from both high memory consumption and Garbage Collection weirdness.
Hi @josalhor, sorry for delay on this! Let me split up this issue into two:
To limit memory consumption of qNEI, the easiest way it probably to use our modular BotAx setup (so Models.BOTORCH_MODULAR
instead of Models.GPEI
that AxClient is using for you under the hood right now). To do so, you'll need to:
Models.BOTORCH_MODULAR
and pass acquisition function options to it (check out using Models.BOTORCH_MODULAR
in generation strategies section of the modular BotAx tutorial for instructions);
model_kwargs
for the BoTorch generation step, specify "acquisition_options" to something like {"optimizer_options": {"num_restarts": 10, "raw_samples": 256}}
,AxClient
via AxClient(generation_strategy=...)
.You should end up with something like this:
from ax.modelbridge.generation_strategy import GenerationStep, GenerationStrategy
from ax.modelbridge.registry import Models
gs = GenerationStrategy(
steps=[
GenerationStep( # Initialization step
# Which model to use for this step
model=Models.SOBOL,
# How many generator runs (each of which is then made a trial)
# to produce with this step
num_trials=5,
# How many trials generated from this step must be `COMPLETED`
# before the next one
min_trials_observed=5,
),
GenerationStep( # BayesOpt step
model=Models.BOTORCH_MODULAR,
# No limit on how many generator runs will be produced
num_trials=-1,
model_kwargs={ # Kwargs to pass to `BoTorchModel.__init__`
"acquisition_options": {"optimizer_options": {"num_restarts": 10, "raw_samples": 256}}
},
)
]
)
ax_client = AxClient(generation_strategy=gs)
Let us know if this doesn't work for you! With this, I'll consider part 1 of the issue resolved and will mark it as wishlist for part 2.
Maybe worth mentioning I think you can also specify it as the following based on https://ax.dev/api/_modules/ax/models/torch/alebo.html#ALEBO.gen, but maybe that only applies to ALEBO:
GenerationStep( # BayesOpt step
model=Models.BOTORCH_MODULAR,
# No limit on how many generator runs will be produced
num_trials=-1,
model_gen_kwargs={"num_restarts": 10, "raw_samples": 256},
},
)
Hello again (I hope I am not causing too much trouble to the team :) ),
I am here to report a possible bug. Attaching many trials through the Service API consumes a lot of memory. Here you have an example: (Warning: A couple of machines started thrashing and then froze with the following code).
I've replicated this issue with MOO. Reducing the percentage (e.g to 20/100) does not cause this behavior. If I had to guess, it does not appear to be a memory leak. I think that some internal operation of ax/botorch has very steep memory complexity.