FalkonML / falkon

Large-scale, multi-GPU capable, kernel solver
https://falkonml.github.io/falkon/
MIT License
181 stars 22 forks source link

Segmentation fault #68

Open ahabedsoltan opened 11 months ago

ahabedsoltan commented 11 months ago

Hello,

I've been using FALKON, and it functions well on a GPU with a small number of centers. However, when I increase the number of centers to around 64,000, I encounter a "Segmentation fault" error, causing the program to terminate.

Here's the sequence of events during the run:

falkon starts... MainProcess.MainThread::[Calcuating Preconditioner of size 128000] Preconditioner will run on 1 GPUs --MainProcess.MainThread::[Kernel] --MainProcess.MainThread::[Kernel] complete in 80.324s --MainProcess.MainThread::[Cholesky 1] Using parallel POTRF --MainProcess.MainThread::[Cholesky 1] complete in 47.152s --MainProcess.MainThread::[Copy triangular] --MainProcess.MainThread::[Copy triangular] complete in 17.284s --MainProcess.MainThread::[LAUUM(CUDA)] --MainProcess.MainThread::[LAUUM(CUDA)] complete in 56.486s --MainProcess.MainThread::[Cholesky 2] Segmentation fault

I'm curious about why this issue arises with a large number of centers. Previously, I've successfully used FALKON with up to 256,000 centers. It seems that with the current updated version, there are issues at this scale. Your assistance in resolving this matter would be greatly appreciated.

parthe commented 11 months ago

We have tried setting the following options, yet the seg-fault persists. never_store_kernel=True chol_force_kernel=True no_single_kernel=False

parthe commented 11 months ago

Here is a minimal working code that reproduces the error that was raised by @ahabedsoltan

import falkon, torch

n, N, M, d, bw = 200_000, 1000, 64_000, 1, 1.

accufun = lambda yt, yh: 100 * (yt.argmax(dim=1) == yh.argmax(dim=1)).sum() / yh.shape[0]

options = falkon.FalkonOptions(debug=True,
    never_store_kernel=True,
    chol_force_ooc=True,
    no_single_kernel=False)
kernel_fn_flk = falkon.kernels.LaplacianKernel(sigma=bw, opt=options)
model = falkon.Falkon(kernel=kernel_fn_flk, penalty=1e-6, M=M, options=options,
                      error_every=1, error_fn=accufun, maxiter=1)

X = torch.randn(n, d)
Y = torch.randn(n, d)
x = torch.randn(N, d)
y = torch.randn(N, d)
model.fit(X, Y, Xts=x, Yts=y)
Giodiro commented 11 months ago

Hi! I think it was a bug in a small helper function, it should be fixed on master! Are you comfortable trying it out like this or do you prefer if I release a new version?

ahabedsoltan commented 11 months ago

Thank you. Could you please create a pre-built wheel for it? Each time I try to install it using the command 'pip install git+https://github.com/falkonml/falkon.git', the installation fails.

parthe commented 11 months ago

Reinstalling falkon as follows solved the issue. @Giodiro Thanks for the quick bug-fix!

pip uninstall falkon
pip install --no-build-isolation git+https://github.com/FalkonML/falkon.git
ahabedsoltan commented 10 months ago

Thank you it resolved the issue.