KindXiaoming / pykan

Kolmogorov Arnold Networks
MIT License
14.72k stars 1.35k forks source link

When running on Apple GPU (MPS), the loss is always nan. #199

Open CaSiOFT opened 4 months ago

CaSiOFT commented 4 months ago

I am using an M1 Pro Mac. I have installed the latest version of PyTorch provided by the official for MPS. Previously, when running other models on GPU, I did not encounter similar issues. When running the example provided in the official documentation at the beginning, if running on CPU, the result is normal; if running on mps, the result is always nan.

import kan
import torch

device = torch.device("mps" if torch.backends.mps.is_available() else ("cuda" if torch.cuda.is_available() else "cpu"))
# device = torch.device("cpu")

# create a KAN: 2D inputs, 1D output, and 5 hidden neurons. cubic spline (k=3), 5 grid intervals (grid=5).
model = kan.KAN(width=[2, 5, 1], grid=5, k=3, seed=0, device=device)

# create dataset f(x,y) = exp(sin(pi*x)+y^2)
f = lambda x: torch.exp(torch.sin(torch.pi * x[:, [0]]) + x[:, [1]] ** 2)
dataset = kan.create_dataset(f, n_var=2, device=device)
print(dataset['train_input'].shape, dataset['train_label'].shape)
# plot KAN at initialization
model(dataset['train_input'])
model.plot(beta=100)
# train the model
model.train(dataset, opt="LBFGS", steps=20, lamb=0.01, lamb_entropy=10., device=device)
model.plot()

The above is my code. When running the code, there may be exceptions during training, and there is also a probability of generating exceptions when drawing the model structure. The error message is ValueError: alpha (nan) is outside 0-1 range. Even if there is no error, the plotted graph will have significant differences from CPU (the lines on the image are obviously thinner, and the function graphs inside the nodes fluctuate abnormally). output

I found a similar issue under issues: I don't know why but if use MPS(Apple SIlicon) to loss is nan.

model.train(dataset, opt="LBFGS", steps=20, lamb=0.01, lamb_entropy=10., device=device.type);
train loss: nan | test loss: nan | reg: nan : 100%|█████████████████| 20/20 [00:03<00:00,  5.11it/s]

Originally posted by @brainer3220 in https://github.com/KindXiaoming/pykan/issues/98#issuecomment-2097514857

CaSiOFT commented 4 months ago

I ran the test using the same code on another CUDA machine, and the results were completely normal.

Stealeristaken commented 4 months ago

I ran the test using the same code on another CUDA machine, and the results were completely normal.

well thats weird

daguo7 commented 4 months ago

I have the same problem. Did you solve it? Bro

CaSiOFT commented 4 months ago

I have the same problem. Did you solve it? Bro

No, I don't have any ideas on how to handle this issue. Are you experiencing the exact same problem?

daguo7 commented 4 months ago

Yes,when I run the last part of hellokan.The train loss and test loss is nan.The same result of symbolic formula.

Stealeristaken commented 4 months ago

I have the same problem mentioned it in #179

CaSiOFT commented 4 months ago

I have the same problem mentioned it in #179我遇到了同样的问题,已在 #179 中提及。

Are you using mps? Or cuda?

Stealeristaken commented 4 months ago

I have the same problem mentioned it in #179我遇到了同样的问题,已在 #179 中提及。

Are you using mps? Or cuda?

I'm using simple apple silicon so can not use cuda

daguo7 commented 4 months ago

I’m same with you

Stealeristaken commented 4 months ago

Can you guys try prune your model with threshold =2e-1

daguo7 commented 4 months ago

Bro,I tried. The problem seems to have been solved.Thank you

daguo7 commented 4 months ago

Bro,I tried. The problem seems to have been solved.Thank you

Stealeristaken commented 4 months ago

Problem caused by loss function as author mentioned so lower threshold gives better results but worse prunes

wkqian06 commented 4 months ago

There is something wrong using torch.linalg.lstsq in spline.py. it can lead to nan when we are not using CPU. I have added an alternative method to calculate the coef in my branch with the following setting.

model = KAN(width=[2,5,1], grid=5, k=3, seed=0, device=device_set,
            coef_method='svd',
            )

Hope it works on Apple GPU.

brent-halen commented 4 months ago

There is something wrong using torch.linalg.lstsq in spline.py. it can lead to nan when we are not using CPU. I have added an alternative method to calculate the coef in my branch with the following setting.

model = KAN(width=[2,5,1], grid=5, k=3, seed=0, device=device_set,
            coef_method='svd',
            )

Hope it works on Apple GPU.

I tried running this on a couple of the examples in my CUDA set up, but I got the following error:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[9], line 1
----> 1 model.train(dataset, opt="LBFGS", steps=20);

Cell In[7], line 888, in KAN.train(self, dataset, opt, steps, log, lamb, lamb_l1, lamb_entropy, lamb_coef, lamb_coefdiff, update_grid, grid_update_num, loss_fn, lr, stop_grid_update_step, batch, small_mag_threshold, small_reg_factor, metrics, sglr_avoid, save_fig, in_vars, out_vars, beta, save_fig_freq, img_folder, device)
    885 test_id = np.random.choice(dataset['test_input'].shape[0], batch_size_test, replace=False)
    887 if _ % grid_update_freq == 0 and _ < stop_grid_update_step and update_grid:
--> 888     self.update_grid_from_samples(dataset['train_input'][train_id].to(device))
    890 if opt == "LBFGS":
    891     optimizer.step(closure)

Cell In[7], line 233, in KAN.update_grid_from_samples(self, x)
    210 '''
    211 update grid from samples
    212 
   (...)
    230 tensor([0.0128, 1.0064, 2.0000, 2.9937, 3.9873, 4.9809])
    231 '''
    232 for l in range(self.depth):
--> 233     self.forward(x)
    234     self.act_fun[l].update_grid_from_samples(self.acts[l])

Cell In[7], line 301, in KAN.forward(self, x)
    297 self.acts.append(x)  # acts shape: (batch, width[l])
    299 for l in range(self.depth):
--> 301     x_numerical, preacts, postacts_numerical, postspline = self.act_fun[l](x)
    303     if self.symbolic_enabled == True:
    304         x_symbolic, postacts_symbolic = self.symbolic_fun[l](x)

File /usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py:1532, in Module._wrapped_call_impl(self, *args, **kwargs)
   1530     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1531 else:
-> 1532     return self._call_impl(*args, **kwargs)

File /usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py:1541, in Module._call_impl(self, *args, **kwargs)
   1536 # If we don't have any hooks, we want to skip the rest of the logic in
   1537 # this function, and just call forward.
   1538 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1539         or _global_backward_pre_hooks or _global_backward_hooks
   1540         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1541     return forward_call(*args, **kwargs)
   1543 try:
   1544     result = None

Cell In[6], line 167, in KANLayer.forward(self, x)
    165 batch = x.shape[0]
    166 # x: shape (batch, in_dim) => shape (size, batch) (size = out_dim * in_dim)
--> 167 x = torch.einsum('ij,k->ikj', x, torch.ones(self.out_dim, device=self.device)).reshape(batch, self.size).permute(1, 0)
    168 preacts = x.permute(1, 0).clone().reshape(batch, self.out_dim, self.in_dim)
    169 base = self.base_fun(x).permute(1, 0)  # shape (batch, size)

File /usr/local/lib/python3.11/dist-packages/torch/functional.py:385, in einsum(*args)
    380     return einsum(equation, *_operands)
    382 if len(operands) <= 2 or not opt_einsum.enabled:
    383     # the path for contracting 0 or 1 time(s) is already optimized
    384     # or the user has disabled using opt_einsum
--> 385     return _VF.einsum(equation, operands)  # type: ignore[attr-defined]
    387 path = None
    388 if opt_einsum.is_available():

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
wkqian06 commented 4 months ago

I tried running this on a couple of the examples in my CUDA set up, but I got the following error:

You may want to add a device parameter using train. The default device setup for train is 'cpu'. model.train(dataset, opt="LBFGS", steps=20, device = 'cuda').

CaSiOFT commented 4 months ago

Can you guys try prune your model with threshold =2e-1

I tried, but there are still problems. Can you share the code? Thanks.

CaSiOFT commented 4 months ago

There is something wrong using torch.linalg.lstsq in spline.py. it can lead to nan when we are not using CPU. I have added an alternative method to calculate the coef in my branch with the following setting.

model = KAN(width=[2,5,1], grid=5, k=3, seed=0, device=device_set,
            coef_method='svd',
            )

Hope it works on Apple GPU.

I tried your code, but the following error occurred: NotImplementedError: The operator 'aten::linalg_lstsq.out' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on https://github.com/pytorch/pytorch/issues/77764. As a temporary fix, you can set the environment variable PYTORCH_ENABLE_MPS_FALLBACK=1 to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS. Unfortunately it seems that the MPS version of PyTorch does not implement this operation. I followed the instructions to set the environment variables, but it didn't work. But thank you for sharing.

Stealeristaken commented 4 months ago

There is something wrong using torch.linalg.lstsq in spline.py. it can lead to nan when we are not using CPU. I have added an alternative method to calculate the coef in my branch with the following setting.

model = KAN(width=[2,5,1], grid=5, k=3, seed=0, device=device_set,
            coef_method='svd',
            )

Hope it works on Apple GPU.

I tried your code, but the following error occurred: NotImplementedError: The operator 'aten::linalg_lstsq.out' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on pytorch/pytorch#77764. As a temporary fix, you can set the environment variable PYTORCH_ENABLE_MPS_FALLBACK=1 to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS. Unfortunately it seems that the MPS version of PyTorch does not implement this operation. I followed the instructions to set the environment variables, but it didn't work. But thank you for sharing.

Maybe update your torch version could help for this try pip install -U torch and run again

Stealeristaken commented 4 months ago

Can you guys try prune your model with threshold =2e-1

I tried, but there are still problems. Can you share the code? Thanks.

Don't have a spesific code i do believe this is a problem with loss function outputs so pruning with bigger thresholds could help try more spesific values. But this is soooo weird it only happens in mps devices. It scratches my brain tbh

wkqian06 commented 4 months ago

I tried your code, but the following error occurred: NotImplementedError: The operator 'aten::linalg_lstsq.out' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on pytorch/pytorch#77764. As a temporary fix, you can set the environment variable PYTORCH_ENABLE_MPS_FALLBACK=1 to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS. Unfortunately it seems that the MPS version of PyTorch does not implement this operation. I followed the instructions to set the environment variables, but it didn't work. But thank you for sharing.

It's weird though. I used torch.linalg.svd instead of torch.linalg.lstsq in my code. Not sure why this happens in your case.

I have made some updates in my branch, avoiding any nan, inf, and -inf in coef results, which, at least in my case, works and avoids nan in the loss. This time, just use the default settings for training. model.train(dataset, opt="LBFGS", steps=20)

CaSiOFT commented 4 months ago

Can you guys try prune your model with threshold =2e-1

I tried, but there are still problems. Can you share the code? Thanks.

Don't have a spesific code i do believe this is a problem with loss function outputs so pruning with bigger thresholds could help try more spesific values. But this is soooo weird it only happens in mps devices. It scratches my brain tbh

I updated to version 2.3.0 of PyTorch, but it didn't improve, unfortunately. As for pruning, in fact, none of the units in the model produced qualified results, so pruning had no effect either. I can only assume that there is a problem with the PyTorch implementation of MPS, but I have never encountered platform issues before, so I am very confused.

CaSiOFT commented 4 months ago

I tried your code, but the following error occurred: NotImplementedError: The operator 'aten::linalg_lstsq.out' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on pytorch/pytorch#77764. As a temporary fix, you can set the environment variable PYTORCH_ENABLE_MPS_FALLBACK=1 to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS. Unfortunately it seems that the MPS version of PyTorch does not implement this operation. I followed the instructions to set the environment variables, but it didn't work. But thank you for sharing.

It's weird though. I used torch.linalg.svd instead of torch.linalg.lstsq in my code. Not sure why this happens in your case.

I have made some updates in my branch, avoiding any nan, inf, and -inf in coef results, which, at least in my case, works and avoids nan in the loss. This time, just use the default settings for training. model.train(dataset, opt="LBFGS", steps=20)

I reset PYTORCH_ENABLE_MPS_FALLBACK=1 in the appropriate position, and it took effect. But unfortunately, it didn't help much. I tried your latest code. If coef_method='lstsq' is set, the training result is still nan. If coef_method='svd' is set, it directly reports an error: [MPSNDArrayDescriptor sliceDimension:withSubrange:] error: the range subRange.start + subRange.length does not fit in dimension[1] (10). It's amazing that if set to CPU, both coef_method can work properly. SVD performs slightly better.

wkqian06 commented 4 months ago

I reset PYTORCH_ENABLE_MPS_FALLBACK=1 in the appropriate position, and it took effect. But unfortunately, it didn't help much. I tried your latest code. If coef_method='lstsq' is set, the training result is still nan. If coef_method='svd' is set, it directly reports an error: [MPSNDArrayDescriptor sliceDimension:withSubrange:] error: the range subRange.start + subRange.length does not fit in dimension[1] (10). It's amazing that if set to CPU, both coef_method can work properly. SVD performs slightly better.

KAN.grid or KANLayer.coef may be the two main reasons why loss is nan during training. I found that KAN.grid can be nan even if it is not trainable during the backpropagation. Delete the initialization self.grid=torch.nn.Parameter(...) might help. As for KANLayer.coef, sometimes the initialization could be nan leading the the failure in the following training. One potential attempt is to set the initialization self.coef = torch.nn.Parameter(curce2coef(...)) to toch.ones with the same size and monitor the learning of coef to check if there is any chance coef being nan.

CaSiOFT commented 4 months ago

I reset PYTORCH_ENABLE_MPS_FALLBACK=1 in the appropriate position, and it took effect. But unfortunately, it didn't help much. I tried your latest code. If coef_method='lstsq' is set, the training result is still nan. If coef_method='svd' is set, it directly reports an error: [MPSNDArrayDescriptor sliceDimension:withSubrange:] error: the range subRange.start + subRange.length does not fit in dimension[1] (10). It's amazing that if set to CPU, both coef_method can work properly. SVD performs slightly better.

KAN.grid or KANLayer.coef may be the two main reasons why loss is nan during training. I found that KAN.grid can be nan even if it is not trainable during the backpropagation. Delete the initialization self.grid=torch.nn.Parameter(...) might help. As for KANLayer.coef, sometimes the initialization could be nan leading the the failure in the following training. One potential attempt is to set the initialization self.coef = torch.nn.Parameter(curce2coef(...)) to toch.ones with the same size and monitor the learning of coef to check if there is any chance coef being nan.

As for the error, I found that it consistently occurs during what should be a completely normal matrix multiplication in one of the loops of the SVD calculation. I casually searched and found many similar issues, such as https://github.com/pytorch/pytorch/issues/113586 and https://github.com/pytorch/pytorch/issues/96153 . It seems to be purely an implementation issue with MPS.

If a deep copy is used for the variables calculated around that error, bypassing the mps error, it will be an error in svdestimator: torch._C._LinAlgError: linalg.svd: The algorithm failed to converge because the input matrix is ill-conditioned or has too many repeated singular values (error code: 7). The reason is that the parameters passed to the function are all nan. There is no such problem when using the CPU.

Initializing with a matrix full of 1 did not improve either.

My interim conclusion is that PyTorch's MPS implementation is a very unstable thing, and any problem could potentially occur.

link24tech commented 4 months ago

Can you guys try prune your model with threshold =2e-1

It works for me in Mac settings.

palemoons commented 3 months ago

Same problem here: nothing wrong on my arch linux PC but got nan result on Mac (using CPU).

For me, changing the optim from LBFGS to Adam as author mentioned in issue89 just works.

model.train(dataset, opt="Adam", steps=50);