Open ahabedsoltan opened 11 months ago
We have tried setting the following options, yet the seg-fault persists.
never_store_kernel=True
chol_force_kernel=True
no_single_kernel=False
Here is a minimal working code that reproduces the error that was raised by @ahabedsoltan
import falkon, torch
n, N, M, d, bw = 200_000, 1000, 64_000, 1, 1.
accufun = lambda yt, yh: 100 * (yt.argmax(dim=1) == yh.argmax(dim=1)).sum() / yh.shape[0]
options = falkon.FalkonOptions(debug=True,
never_store_kernel=True,
chol_force_ooc=True,
no_single_kernel=False)
kernel_fn_flk = falkon.kernels.LaplacianKernel(sigma=bw, opt=options)
model = falkon.Falkon(kernel=kernel_fn_flk, penalty=1e-6, M=M, options=options,
error_every=1, error_fn=accufun, maxiter=1)
X = torch.randn(n, d)
Y = torch.randn(n, d)
x = torch.randn(N, d)
y = torch.randn(N, d)
model.fit(X, Y, Xts=x, Yts=y)
Hi! I think it was a bug in a small helper function, it should be fixed on master! Are you comfortable trying it out like this or do you prefer if I release a new version?
Thank you. Could you please create a pre-built wheel for it? Each time I try to install it using the command 'pip install git+https://github.com/falkonml/falkon.git', the installation fails.
Reinstalling falkon as follows solved the issue. @Giodiro Thanks for the quick bug-fix!
pip uninstall falkon
pip install --no-build-isolation git+https://github.com/FalkonML/falkon.git
Thank you it resolved the issue.
Hello,
I've been using FALKON, and it functions well on a GPU with a small number of centers. However, when I increase the number of centers to around 64,000, I encounter a "Segmentation fault" error, causing the program to terminate.
Here's the sequence of events during the run:
falkon starts... MainProcess.MainThread::[Calcuating Preconditioner of size 128000] Preconditioner will run on 1 GPUs --MainProcess.MainThread::[Kernel] --MainProcess.MainThread::[Kernel] complete in 80.324s --MainProcess.MainThread::[Cholesky 1] Using parallel POTRF --MainProcess.MainThread::[Cholesky 1] complete in 47.152s --MainProcess.MainThread::[Copy triangular] --MainProcess.MainThread::[Copy triangular] complete in 17.284s --MainProcess.MainThread::[LAUUM(CUDA)] --MainProcess.MainThread::[LAUUM(CUDA)] complete in 56.486s --MainProcess.MainThread::[Cholesky 2] Segmentation fault
I'm curious about why this issue arises with a large number of centers. Previously, I've successfully used FALKON with up to 256,000 centers. It seems that with the current updated version, there are issues at this scale. Your assistance in resolving this matter would be greatly appreciated.