KindXiaoming / pykan

Kolmogorov Arnold Networks
MIT License
14.09k stars 1.28k forks source link

Let's face real world dataset #63

Open yuhai-china opened 3 months ago

yuhai-china commented 3 months ago

so far I find that all exaples is generated by some function, I want to test the KAN in real world. I choose the famous boston house prices dataset from the https://www.kaggle.com/datasets/vikrishnan/boston-house-prices

Here is my test code, and test loss is very bad. Maybe my setting is wrong. please let me know if anybody can test it sucessfully.

from kan import KAN, create_dataset
import torch
import pandas as pd
from sklearn import preprocessing
# Let's scale the columns before plotting them against MEDV
scaler = preprocessing.StandardScaler()

def create_boston_house_data(train_num=450):

    from pandas import read_csv
    #Lets load the dataset and sample some
    column_names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
    data = read_csv('./housing.csv', header=None, delimiter=r"\s+", names=column_names)
    print(data.head(5))
    #data = data.sample(frac=1.0)
    data = data.sample(frac=1).reset_index(drop=True)
    column_sels = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']
    x = data.loc[:,column_sels]
    x = pd.DataFrame(data=scaler.fit_transform(x), columns=column_sels)
    print(x)
    x_tain = x.loc[0:450,column_sels]
    y_train = data.loc[0:450,'MEDV']
    x_test = x.loc[450:,column_sels]
    y_test = data.loc[450:,'MEDV']

    dataset = {}
    dataset['train_input'] = torch.from_numpy(x_tain.values).float()
    dataset['test_input'] = torch.from_numpy(x_test.values).float()

    dataset['train_label'] = torch.from_numpy(y_train.values).float() 
    dataset['test_label'] = torch.from_numpy(y_test.values).float()

    return dataset

dataset = create_boston_house_data()
model = KAN(width=[13,13,13,6,6,2,1], grid=10, k=3, seed=0)
model.train(dataset, opt="Adam", steps=250, lamb=0.001, lamb_entropy=2.);
print(model)

output

train loss: 9.34e+00 | test loss: 7.98e+00 | reg: 1.32e+03 : 100%|█| 250/250 [00:45<00:00,  5.50it/s
KAN(
  (biases): ModuleList(
    (0-1): 2 x Linear(in_features=13, out_features=1, bias=False)
    (2-3): 2 x Linear(in_features=6, out_features=1, bias=False)
    (4): Linear(in_features=2, out_features=1, bias=False)
    (5): Linear(in_features=1, out_features=1, bias=False)
  )
  (act_fun): ModuleList(
    (0-5): 6 x KANLayer(
      (base_fun): SiLU()
    )
  )
  (base_fun): SiLU()
  (symbolic_fun): ModuleList(
    (0-5): 6 x Symbolic_KANLayer()
  )
)
guyko81 commented 3 months ago

your model is too complicated, there are gradient issues if the equation is too large. Try smaller, fewer input variables, smaller function, look at the result and get an understanding of how this model can help you. In my understanding this package is not to create state of the art model out of the box, but write a model that is small but still can give a good prediction to your problem. And since it's small, and if you're lucky, you can replace the splines with functions, and then you have a closed (but definitely not always intuitive) equation that can predict anything, without a black box gradient boost or MLP model. Maybe it's state of the art in that sense that it generalize beyond the training set data distribution, because it's an equation. But the whole point: you have to give your hard work and understanding to the problem, not just a drop in replacement for any machine learning model. At least I use it this way :) Good luck! :)

yuhai-china commented 3 months ago

your model is too complicated, there are gradient issues if the equation is too large. Try smaller, fewer input variables, smaller function, look at the result and get an understanding of how this model can help you. In my understanding this package is not to create state of the art model out of the box, but write a model that is small but still can give a good prediction to your problem. And since it's small, and if you're lucky, you can replace the splines with functions, and then you have a closed (but definitely not always intuitive) equation that can predict anything, without a black box gradient boost or MLP model. Maybe it's state of the art in that sense that it generalize beyond the training set data distribution, because it's an equation. But the whole point: you have to give your hard work and understanding to the problem, not just a drop in replacement for any machine learning model. At least I use it this way :) Good luck! :)

it looks like KAN just a toy for very simple problem. I recommend Jerome Friedman's rule ensembles, is can solve complex problem and also give good explanation https://christophm.github.io/interpretable-ml-book/rulefit.html

guyko81 commented 3 months ago

Although I'm happy with the link you gave, I'm sure pykan is not just a toy. I'm not the author, but you'r conclusion was based on my personal understanding. So I feel responsible for your negative opinion, which feels bad.

yuhai-china commented 3 months ago

Although I'm happy with the link you gave, I'm sure pykan is not just a toy. I'm not the author, but you'r conclusion was based on my personal understanding. So I feel responsible for your negative opinion, which feels bad.

I wish KAN is a new powerful tool and I do take this weekend to learn its code. thank you!

KindXiaoming commented 3 months ago

Hi, I agree with @guyko81, that's basically what I want to say. From my experience (which is also quite limited tbh), it would be great to start from a small model as simple as possible, even just width=[13,1,1], grid=3 and then gradually expand it until it works. You 6-Layer KAN can be unstable to train with my implementation (I don't include extra tricks like normalization), and grid=10 seems too large (potentially have optimization problems). Also may try lamb=0 first, if it works, you can then try pumping up lamb for better interpretability. Also would be good to run other methods first (linear regression or MLP) to get a sense of how hard is the dataset, otherwise if KAN fails, we don't know it's because of KAN or it's because of the dataset is too hard (too few data, too noisy data etc.).

RubensZimbres commented 1 month ago

Same situation, for the Kaggle dataset: binary classification https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud

The issue I face is that even with random variables, R2 is extremely high. Do KANs overfit everything, even random data? Or am I missing something ?

Here's the code

from kan import KAN
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
import torch
import numpy as np
import pandas as pd
from scipy import stats

df=pd.read_csv('creditcard.csv').reset_index()
df=df.iloc[:,1:]

df=df.dropna().sample(frac=1).iloc[:30000]

df.V1=np.random.random(df.shape[0])
df.V2=np.random.random(df.shape[0])
df.V3=np.random.random(df.shape[0])

from sklearn.model_selection import train_test_split

dataset = {}
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:,[1,2,3]], np.array(df.Class), test_size=0.2, random_state=42, shuffle=True)
dataset['train_input'] = torch.from_numpy(np.array(X_train))
dataset['test_input'] = torch.from_numpy(np.array(X_test))
dataset['train_label'] = torch.from_numpy(np.array(y_train))
dataset['test_label'] = torch.from_numpy(np.array(y_test))

X = dataset['train_input']
y = dataset['train_label']

model = KAN(width=[3,1,5], grid=5, k=2) # hidden, input

def train_acc():
    return torch.mean((torch.argmax(model(dataset['train_input']), dim=1) == dataset['train_label']).float())

def test_acc():
    return torch.mean((torch.argmax(model(dataset['test_input']), dim=1) == dataset['test_label']).float())

results = model.train(dataset, opt="Adam", steps=50, metrics=(train_acc, test_acc), loss_fn=torch.nn.CrossEntropyLoss());

out

Thanks in advance

seyidcemkarakas commented 2 days ago

Same situation, for the Kaggle dataset: binary classification https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud

The issue I face is that even with random variables, R2 is extremely high. Do KANs overfit everything, even random data? Or am I missing something ?

Here's the code

from kan import KAN
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
import torch
import numpy as np
import pandas as pd
from scipy import stats

df=pd.read_csv('creditcard.csv').reset_index()
df=df.iloc[:,1:]

df=df.dropna().sample(frac=1).iloc[:30000]

df.V1=np.random.random(df.shape[0])
df.V2=np.random.random(df.shape[0])
df.V3=np.random.random(df.shape[0])

from sklearn.model_selection import train_test_split

dataset = {}
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:,[1,2,3]], np.array(df.Class), test_size=0.2, random_state=42, shuffle=True)
dataset['train_input'] = torch.from_numpy(np.array(X_train))
dataset['test_input'] = torch.from_numpy(np.array(X_test))
dataset['train_label'] = torch.from_numpy(np.array(y_train))
dataset['test_label'] = torch.from_numpy(np.array(y_test))

X = dataset['train_input']
y = dataset['train_label']

model = KAN(width=[3,1,5], grid=5, k=2) # hidden, input

def train_acc():
    return torch.mean((torch.argmax(model(dataset['train_input']), dim=1) == dataset['train_label']).float())

def test_acc():
    return torch.mean((torch.argmax(model(dataset['test_input']), dim=1) == dataset['test_label']).float())

results = model.train(dataset, opt="Adam", steps=50, metrics=(train_acc, test_acc), loss_fn=torch.nn.CrossEntropyLoss());

out

Thanks in advance

Can I ask you something? How we can predict OOT data by using trained KAN model?