Open CaSiOFT opened 6 months ago
I ran the test using the same code on another CUDA machine, and the results were completely normal.
I ran the test using the same code on another CUDA machine, and the results were completely normal.
well thats weird
I have the same problem. Did you solve it? Bro
I have the same problem. Did you solve it? Bro
No, I don't have any ideas on how to handle this issue. Are you experiencing the exact same problem?
Yes,when I run the last part of hellokan.The train loss and test loss is nan.The same result of symbolic formula.
I have the same problem mentioned it in #179
I have the same problem mentioned it in #179我遇到了同样的问题,已在 #179 中提及。
Are you using mps? Or cuda?
I have the same problem mentioned it in #179我遇到了同样的问题,已在 #179 中提及。
Are you using mps? Or cuda?
I'm using simple apple silicon so can not use cuda
I’m same with you
Can you guys try prune your model with threshold =2e-1
Bro,I tried. The problem seems to have been solved.Thank you
Bro,I tried. The problem seems to have been solved.Thank you
Problem caused by loss function as author mentioned so lower threshold gives better results but worse prunes
There is something wrong using torch.linalg.lstsq in spline.py. it can lead to nan when we are not using CPU. I have added an alternative method to calculate the coef in my branch with the following setting.
model = KAN(width=[2,5,1], grid=5, k=3, seed=0, device=device_set,
coef_method='svd',
)
Hope it works on Apple GPU.
There is something wrong using torch.linalg.lstsq in spline.py. it can lead to nan when we are not using CPU. I have added an alternative method to calculate the coef in my branch with the following setting.
model = KAN(width=[2,5,1], grid=5, k=3, seed=0, device=device_set, coef_method='svd', )
Hope it works on Apple GPU.
I tried running this on a couple of the examples in my CUDA set up, but I got the following error:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[9], line 1
----> 1 model.train(dataset, opt="LBFGS", steps=20);
Cell In[7], line 888, in KAN.train(self, dataset, opt, steps, log, lamb, lamb_l1, lamb_entropy, lamb_coef, lamb_coefdiff, update_grid, grid_update_num, loss_fn, lr, stop_grid_update_step, batch, small_mag_threshold, small_reg_factor, metrics, sglr_avoid, save_fig, in_vars, out_vars, beta, save_fig_freq, img_folder, device)
885 test_id = np.random.choice(dataset['test_input'].shape[0], batch_size_test, replace=False)
887 if _ % grid_update_freq == 0 and _ < stop_grid_update_step and update_grid:
--> 888 self.update_grid_from_samples(dataset['train_input'][train_id].to(device))
890 if opt == "LBFGS":
891 optimizer.step(closure)
Cell In[7], line 233, in KAN.update_grid_from_samples(self, x)
210 '''
211 update grid from samples
212
(...)
230 tensor([0.0128, 1.0064, 2.0000, 2.9937, 3.9873, 4.9809])
231 '''
232 for l in range(self.depth):
--> 233 self.forward(x)
234 self.act_fun[l].update_grid_from_samples(self.acts[l])
Cell In[7], line 301, in KAN.forward(self, x)
297 self.acts.append(x) # acts shape: (batch, width[l])
299 for l in range(self.depth):
--> 301 x_numerical, preacts, postacts_numerical, postspline = self.act_fun[l](x)
303 if self.symbolic_enabled == True:
304 x_symbolic, postacts_symbolic = self.symbolic_fun[l](x)
File /usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py:1532, in Module._wrapped_call_impl(self, *args, **kwargs)
1530 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1531 else:
-> 1532 return self._call_impl(*args, **kwargs)
File /usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py:1541, in Module._call_impl(self, *args, **kwargs)
1536 # If we don't have any hooks, we want to skip the rest of the logic in
1537 # this function, and just call forward.
1538 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1539 or _global_backward_pre_hooks or _global_backward_hooks
1540 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1541 return forward_call(*args, **kwargs)
1543 try:
1544 result = None
Cell In[6], line 167, in KANLayer.forward(self, x)
165 batch = x.shape[0]
166 # x: shape (batch, in_dim) => shape (size, batch) (size = out_dim * in_dim)
--> 167 x = torch.einsum('ij,k->ikj', x, torch.ones(self.out_dim, device=self.device)).reshape(batch, self.size).permute(1, 0)
168 preacts = x.permute(1, 0).clone().reshape(batch, self.out_dim, self.in_dim)
169 base = self.base_fun(x).permute(1, 0) # shape (batch, size)
File /usr/local/lib/python3.11/dist-packages/torch/functional.py:385, in einsum(*args)
380 return einsum(equation, *_operands)
382 if len(operands) <= 2 or not opt_einsum.enabled:
383 # the path for contracting 0 or 1 time(s) is already optimized
384 # or the user has disabled using opt_einsum
--> 385 return _VF.einsum(equation, operands) # type: ignore[attr-defined]
387 path = None
388 if opt_einsum.is_available():
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
I tried running this on a couple of the examples in my CUDA set up, but I got the following error:
You may want to add a device parameter using train. The default device setup for train is 'cpu'.
model.train(dataset, opt="LBFGS", steps=20, device = 'cuda')
.
Can you guys try prune your model with
threshold =2e-1
I tried, but there are still problems. Can you share the code? Thanks.
There is something wrong using torch.linalg.lstsq in spline.py. it can lead to nan when we are not using CPU. I have added an alternative method to calculate the coef in my branch with the following setting.
model = KAN(width=[2,5,1], grid=5, k=3, seed=0, device=device_set, coef_method='svd', )
Hope it works on Apple GPU.
I tried your code, but the following error occurred:
NotImplementedError: The operator 'aten::linalg_lstsq.out' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on https://github.com/pytorch/pytorch/issues/77764. As a temporary fix, you can set the environment variable PYTORCH_ENABLE_MPS_FALLBACK=1
to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.
Unfortunately it seems that the MPS version of PyTorch does not implement this operation. I followed the instructions to set the environment variables, but it didn't work. But thank you for sharing.
There is something wrong using torch.linalg.lstsq in spline.py. it can lead to nan when we are not using CPU. I have added an alternative method to calculate the coef in my branch with the following setting.
model = KAN(width=[2,5,1], grid=5, k=3, seed=0, device=device_set, coef_method='svd', )
Hope it works on Apple GPU.
I tried your code, but the following error occurred: NotImplementedError: The operator 'aten::linalg_lstsq.out' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on pytorch/pytorch#77764. As a temporary fix, you can set the environment variable
PYTORCH_ENABLE_MPS_FALLBACK=1
to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS. Unfortunately it seems that the MPS version of PyTorch does not implement this operation. I followed the instructions to set the environment variables, but it didn't work. But thank you for sharing.
Maybe update your torch version could help for this try pip install -U torch
and run again
Can you guys try prune your model with
threshold =2e-1
I tried, but there are still problems. Can you share the code? Thanks.
Don't have a spesific code i do believe this is a problem with loss function outputs so pruning with bigger thresholds could help try more spesific values. But this is soooo weird it only happens in mps devices. It scratches my brain tbh
I tried your code, but the following error occurred: NotImplementedError: The operator 'aten::linalg_lstsq.out' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on pytorch/pytorch#77764. As a temporary fix, you can set the environment variable
PYTORCH_ENABLE_MPS_FALLBACK=1
to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS. Unfortunately it seems that the MPS version of PyTorch does not implement this operation. I followed the instructions to set the environment variables, but it didn't work. But thank you for sharing.
It's weird though. I used torch.linalg.svd instead of torch.linalg.lstsq in my code. Not sure why this happens in your case.
I have made some updates in my branch, avoiding any nan, inf, and -inf in coef results, which, at least in my case, works and avoids nan in the loss. This time, just use the default settings for training.
model.train(dataset, opt="LBFGS", steps=20)
Can you guys try prune your model with
threshold =2e-1
I tried, but there are still problems. Can you share the code? Thanks.
Don't have a spesific code i do believe this is a problem with loss function outputs so pruning with bigger thresholds could help try more spesific values. But this is soooo weird it only happens in mps devices. It scratches my brain tbh
I updated to version 2.3.0 of PyTorch, but it didn't improve, unfortunately. As for pruning, in fact, none of the units in the model produced qualified results, so pruning had no effect either. I can only assume that there is a problem with the PyTorch implementation of MPS, but I have never encountered platform issues before, so I am very confused.
I tried your code, but the following error occurred: NotImplementedError: The operator 'aten::linalg_lstsq.out' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on pytorch/pytorch#77764. As a temporary fix, you can set the environment variable
PYTORCH_ENABLE_MPS_FALLBACK=1
to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS. Unfortunately it seems that the MPS version of PyTorch does not implement this operation. I followed the instructions to set the environment variables, but it didn't work. But thank you for sharing.It's weird though. I used torch.linalg.svd instead of torch.linalg.lstsq in my code. Not sure why this happens in your case.
I have made some updates in my branch, avoiding any nan, inf, and -inf in coef results, which, at least in my case, works and avoids nan in the loss. This time, just use the default settings for training.
model.train(dataset, opt="LBFGS", steps=20)
I reset PYTORCH_ENABLE_MPS_FALLBACK=1 in the appropriate position, and it took effect. But unfortunately, it didn't help much. I tried your latest code. If coef_method='lstsq' is set, the training result is still nan. If coef_method='svd' is set, it directly reports an error: [MPSNDArrayDescriptor sliceDimension:withSubrange:] error: the range subRange.start + subRange.length does not fit in dimension[1] (10). It's amazing that if set to CPU, both coef_method can work properly. SVD performs slightly better.
I reset PYTORCH_ENABLE_MPS_FALLBACK=1 in the appropriate position, and it took effect. But unfortunately, it didn't help much. I tried your latest code. If coef_method='lstsq' is set, the training result is still nan. If coef_method='svd' is set, it directly reports an error: [MPSNDArrayDescriptor sliceDimension:withSubrange:] error: the range subRange.start + subRange.length does not fit in dimension[1] (10). It's amazing that if set to CPU, both coef_method can work properly. SVD performs slightly better.
KAN.grid or KANLayer.coef may be the two main reasons why loss is nan during training. I found that KAN.grid can be nan even if it is not trainable during the backpropagation. Delete the initialization self.grid=torch.nn.Parameter(...) might help. As for KANLayer.coef, sometimes the initialization could be nan leading the the failure in the following training. One potential attempt is to set the initialization self.coef = torch.nn.Parameter(curce2coef(...)) to toch.ones with the same size and monitor the learning of coef to check if there is any chance coef being nan.
I reset PYTORCH_ENABLE_MPS_FALLBACK=1 in the appropriate position, and it took effect. But unfortunately, it didn't help much. I tried your latest code. If coef_method='lstsq' is set, the training result is still nan. If coef_method='svd' is set, it directly reports an error: [MPSNDArrayDescriptor sliceDimension:withSubrange:] error: the range subRange.start + subRange.length does not fit in dimension[1] (10). It's amazing that if set to CPU, both coef_method can work properly. SVD performs slightly better.
KAN.grid or KANLayer.coef may be the two main reasons why loss is nan during training. I found that KAN.grid can be nan even if it is not trainable during the backpropagation. Delete the initialization self.grid=torch.nn.Parameter(...) might help. As for KANLayer.coef, sometimes the initialization could be nan leading the the failure in the following training. One potential attempt is to set the initialization self.coef = torch.nn.Parameter(curce2coef(...)) to toch.ones with the same size and monitor the learning of coef to check if there is any chance coef being nan.
As for the error, I found that it consistently occurs during what should be a completely normal matrix multiplication in one of the loops of the SVD calculation. I casually searched and found many similar issues, such as https://github.com/pytorch/pytorch/issues/113586 and https://github.com/pytorch/pytorch/issues/96153 . It seems to be purely an implementation issue with MPS.
If a deep copy is used for the variables calculated around that error, bypassing the mps error, it will be an error in svdestimator: torch._C._LinAlgError: linalg.svd: The algorithm failed to converge because the input matrix is ill-conditioned or has too many repeated singular values (error code: 7). The reason is that the parameters passed to the function are all nan. There is no such problem when using the CPU.
Initializing with a matrix full of 1 did not improve either.
My interim conclusion is that PyTorch's MPS implementation is a very unstable thing, and any problem could potentially occur.
Can you guys try prune your model with
threshold =2e-1
It works for me in Mac settings.
I am using an M1 Pro Mac. I have installed the latest version of PyTorch provided by the official for MPS. Previously, when running other models on GPU, I did not encounter similar issues. When running the example provided in the official documentation at the beginning, if running on CPU, the result is normal; if running on mps, the result is always nan.
The above is my code. When running the code, there may be exceptions during training, and there is also a probability of generating exceptions when drawing the model structure. The error message is
ValueError: alpha (nan) is outside 0-1 range
. Even if there is no error, the plotted graph will have significant differences from CPU (the lines on the image are obviously thinner, and the function graphs inside the nodes fluctuate abnormally).I found a similar issue under issues: I don't know why but if use MPS(Apple SIlicon) to loss is nan.
Originally posted by @brainer3220 in https://github.com/KindXiaoming/pykan/issues/98#issuecomment-2097514857