KindXiaoming / pykan

Kolmogorov Arnold Networks
MIT License
15.03k stars 1.39k forks source link

The loss did not decrease when training on the observations #225

Open wkqian06 opened 5 months ago

wkqian06 commented 5 months ago

I trained the following model using inputs with [1600,4] obtained from observations

model = KAN(width=[x.shape[1],4,4,1], grid=3, k=3, seed=0, device=device_set, )
model.update_grid_from_samples(dataset['train_input'])
grids = [3,5,10,20,50]

train_rmse = []
test_rmse = []

for i in range(len(grids)):
    model = KAN(width=[x.shape[1],4,4,1], grid=grids[i], k=3, seed=0,device=device_set,
                # coef_method='svd',
                ).initialize_from_another_model(model, dataset['train_input'])
    results = model.train(dataset, opt="LBFGS", steps=20, stop_grid_update_step=30,lamb=0.1,lr=0.1);
    train_rmse.append(results['train_loss'][-1].item())
    test_rmse.append(results['test_loss'][-1].item())

But there was no change in the loss seed = 0, lamb=0.1, lr = 0.1

train loss: 3.54e+00 | test loss: 3.61e+00 | reg: 8.97e+00 : 100%|██| 20/20 [00:15<00:00,  1.30it/s]
train loss: 3.54e+00 | test loss: 3.61e+00 | reg: 7.38e+00 : 100%|██| 20/20 [00:15<00:00,  1.29it/s]
train loss: 3.54e+00 | test loss: 3.61e+00 | reg: 7.38e+00 : 100%|██| 20/20 [00:16<00:00,  1.22it/s]
train loss: 3.54e+00 | test loss: 3.62e+00 | reg: 5.51e+00 : 100%|██| 20/20 [00:18<00:00,  1.11it/s]
train loss: 3.55e+00 | test loss: 3.63e+00 | reg: 7.13e+00 : 100%|██| 20/20 [00:23<00:00,  1.16s/it]

Normalization, optimisor, and different arguments, such as seed, lr, and lamb, did not work. eg. seed = 1253, lamb=1, lr = 1

train loss: 3.61e+00 | test loss: 3.69e+00 | reg: 6.94e+00 : 100%|██| 20/20 [00:14<00:00,  1.36it/s]
train loss: 3.61e+00 | test loss: 3.69e+00 | reg: 6.62e+00 : 100%|██| 20/20 [00:14<00:00,  1.36it/s]
train loss: 3.61e+00 | test loss: 3.69e+00 | reg: 6.53e+00 : 100%|██| 20/20 [00:15<00:00,  1.28it/s]
train loss: 3.61e+00 | test loss: 3.69e+00 | reg: 6.75e+00 : 100%|██| 20/20 [00:18<00:00,  1.09it/s]
train loss: 3.61e+00 | test loss: 3.68e+00 | reg: 7.13e+00 : 100%|██| 20/20 [00:24<00:00,  1.21s/it]

seed = 1253, lamb=1, lr = 0.01

train loss: 3.61e+00 | test loss: 3.69e+00 | reg: 8.21e+00 : 100%|██| 20/20 [00:15<00:00,  1.31it/s]
train loss: 3.61e+00 | test loss: 3.69e+00 | reg: 7.70e+00 : 100%|██| 20/20 [00:15<00:00,  1.28it/s]
train loss: 3.61e+00 | test loss: 3.69e+00 | reg: 7.63e+00 : 100%|██| 20/20 [00:16<00:00,  1.19it/s]
train loss: 3.61e+00 | test loss: 3.68e+00 | reg: 7.86e+00 : 100%|██| 20/20 [00:18<00:00,  1.06it/s]
train loss: 3.60e+00 | test loss: 3.68e+00 | reg: 8.16e+00 : 100%|██| 20/20 [00:23<00:00,  1.20s/it]

seed = 1253, lamb=0, lr = 0.01

train loss: 3.54e+00 | test loss: 3.61e+00 | reg: 2.49e+01 : 100%|██| 20/20 [00:08<00:00,  2.36it/s]
train loss: 3.54e+00 | test loss: 3.61e+00 | reg: 2.49e+01 : 100%|██| 20/20 [00:08<00:00,  2.43it/s]
train loss: 3.54e+00 | test loss: 3.61e+00 | reg: 2.49e+01 : 100%|██| 20/20 [00:04<00:00,  4.16it/s]
train loss: 3.54e+00 | test loss: 3.61e+00 | reg: 2.46e+01 : 100%|██| 20/20 [00:06<00:00,  2.88it/s]
train loss: 3.54e+00 | test loss: 3.61e+00 | reg: 2.46e+01 : 100%|██| 20/20 [00:06<00:00,  3.03it/s]

seed = 1253, lamb=0, lr = 0.01, width = [4,2,2,1]

train loss: 3.54e+00 | test loss: 3.61e+00 | reg: 1.65e+01 : 100%|██| 20/20 [00:05<00:00,  3.91it/s]
train loss: 3.54e+00 | test loss: 3.61e+00 | reg: 1.65e+01 : 100%|██| 20/20 [00:03<00:00,  5.90it/s]
train loss: 3.54e+00 | test loss: 3.61e+00 | reg: 1.65e+01 : 100%|██| 20/20 [00:03<00:00,  5.78it/s]
train loss: 3.54e+00 | test loss: 3.61e+00 | reg: 1.34e+01 : 100%|██| 20/20 [00:09<00:00,  2.03it/s]
train loss: 3.54e+00 | test loss: 3.61e+00 | reg: 1.71e+01 : 100%|██| 20/20 [00:11<00:00,  1.70it/s]

More information may be needed:

Not sure if it is because of the modeling setting, the noise from the observation, or other reasons. Or if it is the best loss the model can get.

KindXiaoming commented 5 months ago

hi, a few observations look suspicious: (1) train loss remains the same for different grid is a bit suspicious. (2) some activation functions looks quite oscillatory for the final plot

I don't really have any good advice off the top of my head, but might worth trying other models (e.g., MLPs) to benchmark the complexity of your dataset.

wkqian06 commented 5 months ago

Hi, thanks for sharing the thoughts. I'll look at the benchmarks for this dataset.

wkqian06 commented 5 months ago

I don't really have any good advice off the top of my head, but might worth trying other models (e.g., MLPs) to benchmark the complexity of your dataset.

I just tried MLP for the dataset, the MSE loss was way smaller than the loss from KAN, which is acceptable. Something weird was the training process seemed to fail on my dataset especially when looking at the estimations and the labels. prediction

KindXiaoming commented 5 months ago

interesting! just realized that you set lamb=0.1 which might be too high, please try what happens with lamb=0.0.

wkqian06 commented 5 months ago

interesting! just realized that you set lamb=0.1 which might be too high, please try what happens with lamb=0.0.

Still, it doesn't work. As shown at the beginning, changing arguments did not improve the training process. It seems that the whole training process failed. Don't know why the predicted labels always tend to converge as the true labels increase. The following is the plot under the lamb=0.0 setting. While for MLP, the plot would be a nice diagonal line. prediction2

KindXiaoming commented 5 months ago

very interesting. A quick observation is that true labels have wide large range. this could possibly lead to KAN failing because KANs are by default using uniform grids. To test this hypothesis, could try to snap examples with small labels and see if this helps.

Could also try KAN(..., grid_eps=0.02, ...) which changes KANs to adaptive grids based on sample distribution.

Yagnik12599 commented 5 months ago

Hi! Were you able to make it work? I'm facing a similar issue as you currently, and grid_eps didn't work as well.

wkqian06 commented 5 months ago

Could also try KAN(..., grid_eps=0.02, ...) which changes KANs to adaptive grids based on sample distribution.

Hi! Were you able to make it work? I'm facing a similar issue as you currently, and grid_eps didn't work as well.

Sorry for the late update, _grideps did not work in my case. Something interesting is that though the original and normalized dataset did not work, the minmax scaled dataset seemed to work to some extent. I'm confused why minmax scaling was the best strategy in this case especially when the adaptive grids did not work.

grid_eps_minamx

Yagnik12599 commented 5 months ago

That does sound interesting! However, in my case, MinMax, or MaxAbsScaler did not work. Did you do any other preprocessing apart from using MinMax? It is strange because in many other cases, KAN did seem to predict equally well if not better than MLP, but for some instances, it just doesn't seem to improve that much(training loss not changing being the issue)

wkqian06 commented 5 months ago

That does sound interesting! However, in my case, MinMax, or MaxAbsScaler did not work. Did you do any other preprocessing apart from using MinMax? It is strange because in many other cases, KAN did seem to predict equally well if not better than MLP, but for some instances, it just doesn't seem to improve that much(training loss not changing being the issue)

No, I only did MinMax scaling. Weighted loss works for this imbalanced dataset to reduce the MAE. But I would say, even though the final plots looked better, I still did not consider it a success because the patterns where the observations were lower than 20 were still strange. scatterplot

I wonder why the training process stops that fast in my case (training loss not changing), ending up with a bad performance.