facebook / Ax

Adaptive Experimentation Platform
https://ax.dev
MIT License
2.35k stars 303 forks source link

SAASBO Tutorial not working on GPU / Pyro error? #1108

Closed winf-hsos closed 2 years ago

winf-hsos commented 2 years ago

When I download the code from https://ax.dev/tutorials/saasbo.html and run it on Google Colab with GPU, I get the following error message:

RuntimeError                              Traceback (most recent call last)
[<ipython-input-21-fb6ba87452b0>](https://localhost:8080/#) in <module>
     24         torch_dtype=tkwargs["dtype"],
     25         verbose=True,  # Set to True to print stats from MCMC
---> 26         disable_progbar=True,  # Set to False to print a progress bar from MCMC
     27     )
     28     generator_run = model.gen(BATCH_SIZE)

20 frames
[/usr/local/lib/python3.7/dist-packages/pyro/infer/mcmc/util.py](https://localhost:8080/#) in _potential_fn_jit(self, skip_jit_warnings, jit_options, params)
    292 
    293         if self._compiled_fn:
--> 294             return self._compiled_fn(*vals)
    295 
    296         with pyro.validation_enabled(False):

RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: Graph::copy() encountered a use of a value 133 not in scope. Run lint!

when running this code cell:

# Experiment
experiment = Experiment(
    name="saasbo_experiment",
    search_space=search_space,
    optimization_config=optimization_config,
    runner=SyntheticRunner(),
)

# Initial Sobol points
sobol = Models.SOBOL(search_space=experiment.search_space)
for _ in range(N_INIT):
    experiment.new_trial(sobol.gen(1)).run()

# Run SAASBO
data = experiment.fetch_data()
for i in range(N_BATCHES):
    model = Models.FULLYBAYESIAN(
        experiment=experiment, 
        data=data,
        num_samples=256,  # Increasing this may result in better model fits
        warmup_steps=512,  # Increasing this may result in better model fits
        gp_kernel="rbf",  # "rbf" is the default in the paper, but we also support "matern"
        torch_device=tkwargs["device"],
        torch_dtype=tkwargs["dtype"],
        verbose=True,  # Set to True to print stats from MCMC
        disable_progbar=True,  # Set to False to print a progress bar from MCMC
    )
    generator_run = model.gen(BATCH_SIZE)
    trial = experiment.new_batch_trial(generator_run=generator_run)
    trial.run()
    data = Data.from_multiple_data([data, trial.fetch_data()])

    new_value = trial.fetch_data().df["mean"].min()
    print(f"Iteration: {i}, Best in iteration {new_value:.3f}, Best so far: {data.df['mean'].min():.3f}")

This only happens when I use CUDA. When I change device to CPU it works fine. The same error occurs on our internal Cluster with an NVIDIA A40 GPU.

BTW: The same error occurs when I use the BoTorch example here: https://botorch.org/tutorials/saasbo Given they use the same libraries that makes perfect sense.

Any help is greatly appreciated! Thanks! Nicolas

winf-hsos commented 2 years ago

A remark: On our cluster I could reproduce the error with Python 3.7.10 and 3.9.5. When I use Python 3.10.6 I get a different error message - don't know if they are related:

 File "/home/nimeseth/env/lib/python3.10/site-packages/ax/models/torch/fully_bayesian.py", line 346, in get_and_fit_model_mcmc
    model, mcmc_samples_list = _get_model_mcmc_samples(
  File "/home/nimeseth/env/lib/python3.10/site-packages/ax/models/torch/fully_bayesian.py", line 302, in _get_model_mcmc_samples
    mcmc_samples = run_inference(
  File "/home/nimeseth/env/lib/python3.10/site-packages/ax/models/torch/fully_bayesian.py", line 407, in run_inference
    mcmc.run(
  File "/home/nimeseth/env/lib/python3.10/site-packages/pyro/poutine/messenger.py", line 11, in _context_wrap
    with context:
AttributeError: __enter__
saitcakmak commented 2 years ago

Hi @winf-hsos. Are you using GPyTorch 1.9.0? If so, downgrading to GPyTorch 1.8.1 will likely fix the issue. We’ll put up a patch fixing this sometime this week. Cc @Balandat

Re python 3.10: We currently do not support 3.10 due to some upstream dependencies. I’d recommend using python 3.8 or 3.9 for now.

winf-hsos commented 2 years ago

@saitcakmak: Thanks for the prompt response. Having read about a similar issue for Ax and BoTorch, where downgrading GPyTorch to 1.8.1 worked, I already tried that. The error is still there. But only on GPU, on CPU it works fine.

Balandat commented 2 years ago

Yeah that doesn’t look like a GPyTorch error.

The __enter__ issue is b/c pyro doesn’t have a py3.10 compatible release at this point.

The other issue is something I am not familiar with. Looks more like a PyTorch error to me. What PyTorch version is this on?

winf-hsos commented 2 years ago
pip show torch

gives me this:

Name: torch
Version: 1.12.1+cu113
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: [packages@pytorch.org](mailto:packages@pytorch.org)
License: BSD-3
Location: /usr/local/lib/python3.7/dist-packages
Requires: typing-extensions
Required-by: torchvision, torchtext, torchaudio, pyro-ppl, gpytorch, fastai, botorch

You can easily reproduce the error when you open this Google Collab notebook and execute until cell 9. That's where the error occurs: https://colab.research.google.com/drive/1ZzHKJ3bMjYKNN14JSLjEVKyT-rD45rw9?usp=sharing

Thanks Nicolas

winf-hsos commented 2 years ago

Make sure you use a GPU runtime when you try it. On CPU it works perfectly fine. I use the free tier on Google Colab and it produces the error.

winf-hsos commented 2 years ago

I added the Ax SAASBO tutorial as a Colab notebook for easy reproduction. Simply run the cells until cell 12 where the same error pops up when using GPU.

https://colab.research.google.com/drive/18xfp1xlceyXrPNUcEzJkHVCxTVEcZAlf?usp=sharing

Balandat commented 2 years ago

I can confirm I can repro the issue with the colab notebook. Looks like we'll need do some digging on our end to understand what's going on here.

saitcakmak commented 2 years ago

As a workaround for now, I can run the colab notebook if I pin torch==1.11 & gpytorch==1.8.1.

Balandat commented 2 years ago

I can run the colab notebook if I pin torch==1.11 & gpytorch==1.8.1.

Hmm interesting, that suggests it might be some kind of compatibility issue between pyro and pytorch 1.12? Since the failure happens on gpytorch 1.8.1 and pytorch 1.12.1+cu113.

winf-hsos commented 2 years ago

As a workaround for now, I can run the colab notebook if I pin torch==1.11 & gpytorch==1.8.1.

Nice catch, that's interesting. And the workaround is very helpful for us! Looking forward to learning the root cause here.

I can confirm the workaround works. Just ran a successful optimization with the aforementioned notebook.

lena-kashtelyan commented 2 years ago

Hi @winf-hsos, Ax 0.2.7 is now out and should address this issue! Please reopen this and let us know if this remains present for you : )

winf-hsos commented 2 years ago

Thanks, that's great @lena-kashtelyan,will try it and let you know ✊