getkeops / keops

KErnel OPerationS, on CPUs and GPUs, with autodiff and without memory overflows
https://www.kernel-operations.io
MIT License
1.04k stars 63 forks source link

[Bug] Running Several Keops Compilations at Once #168

Closed wjmaddox closed 3 years ago

wjmaddox commented 3 years ago

Hi, I'm having troubles running several KeOps operations in parallel on an 8-GPU server (the setup is that I'm attempting to use the separate GPUs to fit gpytorch models simultaneously). It seems to work fine unless two scripts are compiling Keops code at the same time, at which point I start getting strange over-writing? related errors.

On the surface, I wouldn't have expected it to be due to multi-processing, but the script seemingly works fine unless two or more are running at once. Is there an easy fix to allow this type of multi-processing (I'd love to be able to launch these on a larger server but will probably hit more issues there)?

Example script

import argparse
import torch

from gpytorch import settings

from botorch.models import SingleTaskGP
from gpytorch.kernels.keops import MaternKernel
from botorch.optim.fit import fit_gpytorch_torch
from gpytorch.mlls import ExactMarginalLogLikelihood

def main(device, dim):
    train_x = torch.randn(10000, dim, device = device)
    train_y = torch.norm(train_x, dim=-1)
    model = SingleTaskGP(train_x, train_y.view(-1,1), covar_module = MaternKernel())
    mll = ExactMarginalLogLikelihood(model.likelihood, model)
    fit_gpytorch_torch(mll)

    with settings.fast_pred_samples(True), torch.no_grad():
        test_x = torch.randn(1000, dim, device = device)
        pred_post = model.posterior(test_x)
        res = pred_post.rsample(torch.Size((256,))).norm()

if __name__ == "__main__":
   parser = argparse.ArgumentParser()
   parser.add_argument("--device", type=int, default=0)
   parser.add_argument("--dim", type=int, default=10)
   args = parser.parse_args()
   main(args.device, args.dim)

Example commands

python mwe.py --device=0 --dim=10 &
python mwe.py --device=3 --dim=23

Error Message

Traceback (most recent call last):
  File "mwe.py", line 28, in <module>
    main(args.device, args.dim)
  File "mwe.py", line 20, in main
    pred_post = model.posterior(test_x)
  File "/home/wesley_m/botorch/botorch/models/gpytorch.py", line 325, in posterior
    mvn = self(X)
  File "/home/wesley_m/gpytorch/gpytorch/models/exact_gp.py", line 319, in __call__
    predictive_mean, predictive_covar = self.prediction_strategy.exact_prediction(full_mean, full_covar)
  File "/home/wesley_m/gpytorch/gpytorch/models/exact_prediction_strategies.py", line 262, in exact_prediction
    self.exact_predictive_mean(test_mean, test_train_covar),
  File "/home/wesley_m/gpytorch/gpytorch/models/exact_prediction_strategies.py", line 280, in exact_predictive_mean
    res = (test_train_covar @ self.mean_cache.unsqueeze(-1)).squeeze(-1)
  File "/home/wesley_m/gpytorch/gpytorch/utils/memoize.py", line 59, in g
    return _add_to_cache(self, cache_name, method(self, *args, **kwargs), *args, kwargs_pkl=kwargs_pkl)
  File "/home/wesley_m/gpytorch/gpytorch/models/exact_prediction_strategies.py", line 229, in mean_cache
    mean_cache = train_train_covar.evaluate_kernel().inv_matmul(train_labels_offset).squeeze(-1)
  File "/home/wesley_m/gpytorch/gpytorch/lazy/lazy_tensor.py", line 1172, in inv_matmul
    return func.apply(self.representation_tree(), False, right_tensor, *self.representation())
  File "/home/wesley_m/gpytorch/gpytorch/functions/_inv_matmul.py", line 53, in forward
    solves = _solve(lazy_tsr, right_tensor)
  File "/home/wesley_m/gpytorch/gpytorch/functions/_inv_matmul.py", line 21, in _solve
    return lazy_tsr._solve(rhs, preconditioner)
  File "/home/wesley_m/gpytorch/gpytorch/lazy/lazy_tensor.py", line 661, in _solve
    preconditioner=preconditioner,
  File "/home/wesley_m/gpytorch/gpytorch/utils/linear_cg.py", line 174, in linear_cg
    residual = rhs - matmul_closure(initial_guess)
  File "/home/wesley_m/gpytorch/gpytorch/lazy/added_diag_lazy_tensor.py", line 57, in _matmul
    return torch.addcmul(self._lazy_tensor._matmul(rhs), self._diag_tensor._diag.unsqueeze(-1), rhs)
  File "/home/wesley_m/gpytorch/gpytorch/lazy/keops_lazy_tensor.py", line 30, in _matmul
    return self.covar_mat @ rhs.contiguous()
  File "/home/wesley_m/miniconda3/lib/python3.7/site-packages/pykeops/common/lazy_tensor.py", line 2200, in __matmul__
    Kv = Kv.sum(Kv.dim() - 2, **kwargs)  # Matrix-vector or Matrix-matrix product
  File "/home/wesley_m/miniconda3/lib/python3.7/site-packages/pykeops/common/lazy_tensor.py", line 1773, in sum
    return self.reduction("Sum", axis=axis, **kwargs)
  File "/home/wesley_m/miniconda3/lib/python3.7/site-packages/pykeops/common/lazy_tensor.py", line 744, in reduction
    return res()
  File "/home/wesley_m/miniconda3/lib/python3.7/site-packages/pykeops/common/lazy_tensor.py", line 929, in __call__
    return self.callfun(*args, *self.variables, **self.kwargs)
  File "/home/wesley_m/miniconda3/lib/python3.7/site-packages/pykeops/torch/generic/generic_red.py", line 579, in __call__
    *args
  File "/home/wesley_m/miniconda3/lib/python3.7/site-packages/pykeops/torch/generic/generic_red.py", line 48, in forward
    formula, aliases, dtype, "torch", optional_flags, include_dirs
  File "/home/wesley_m/miniconda3/lib/python3.7/site-packages/pykeops/common/keops_io.py", line 48, in __init__
    self._safe_compile()
  File "/home/wesley_m/miniconda3/lib/python3.7/site-packages/pykeops/common/utils.py", line 80, in wrapper_filelock
    os.remove(os.path.join(bf, "pykeops_build2.lock"))
FileNotFoundError: [Errno 2] No such file or directory: '/home/wesley_m/.cache/pykeops-1.5-cpython-37//build-bde5ec9e23/pykeops_build2.lock'

System profile:

cc @activatedgeek

activatedgeek commented 3 years ago

I found a naive trick to make multiple parallel KeOps compilations run. The idea is to simply change the compilation folder in which KeOps operates.

import pykeops
import tempfile
with tempfile.TemporaryDirectory() as dirname:
    pykeops.set_bin_folder(dirname)

    # Run code that triggers compilation.
    main()

The context manager makes sure to delete the temporary folder before exit. The downside is that now we do not benefit from existing cache, and the compilation always happens from scratch. The set_bin_folder needs to be called before any KeOps compilation is triggered.

wjmaddox commented 3 years ago

Closing due to this resolution.

bcharlier commented 3 years ago

Thanks everyone for pointing this and for the trick. The next keops version should have much smaller compilation times, making (in many cases) the lack of cache folder a painless issue.