The calculation of width parameter?

1165048017 commented 3 years ago

Hi, I am confuse about this implementation:

def call(self, x):

        C = tf.expand_dims(self.centers, -1)  # inserts a dimension of 1
        H = tf.transpose(C-tf.transpose(x))  # matrix of differences
        return tf.exp(-self.betas * tf.math.reduce_sum(H**2, axis=1))

I learned RBF neural network from this video There is a width parameter: and then

I haven't seen the use of width parameter in your code. So, I modefied the code like this

def call(self, x):
        C = K.expand_dims(self.centers)
        XC = K.transpose(K.transpose(x)-C)     
        D = K.expand_dims(K.sqrt(K.mean(XC**2,axis=0)),0) 
        H = XC/D
        return K.exp(-self.betas * K.sum(H**2, axis=1))

The full code with experiments is here.

My code overfit easily, but not with your code. Am I wrong?

Thank you for solving my problem~~~ ^_^

PetraVidnerova commented 3 years ago

Hi, I just use beta insted of width ie. instead h = exp( - (x-c)^2 / r) , I use exp( - beta * (x-c)^2) so it should be beta = 1/r

PetraVidnerova commented 3 years ago

I see now that you modified the code, I will have a close look on that later. Overfitting may be connected to too small width, with narraw gaussians you overfitt more easily than with wider gaussians.

1165048017 commented 3 years ago

Hi, I just use beta insted of width ie. instead h = exp( - (x-c)^2 / r) , I use exp( - beta * (x-c)^2) so it should be beta = 1/r

@PetraVidnerova Thanks for your reply~~~ You mean that the width is learned directly! But look into the theory, the width has its own equation. There are two ways as implemeted in this code using numpy. So, I wonder that if it is possible to add these two ways in keras implementation or it is unnecessary to add them because of the learning skill by the neural network.

PetraVidnerova commented 3 years ago

Both widths and centers are tuned during learning in my implementation, since here the whole network is trained using back propagation process (there are many ways of RBF learning).
In the simple example test.py, value 2.0 is provided, that means they are all initialized uniformly to this value. But you can write you ouwn initializer and initilize the widths as you wish. Just instead beta=2.0 in the constructor, you should use beta = MyBetaInitializer().

If you do not want them to change during learning, it has to be changed: instead of trainable=True use trainable = False, in build method of RBFLayer.

1165048017 commented 3 years ago

Both widths and centers are tuned during learning in my implementation, since here the whole network is trained using back propagation process (there are many ways of RBF learning). In the simple example test.py, value 2.0 is provided, that means they are all initialized uniformly to this value. But you can write you ouwn initializer and initilize the widths as you wish. Just instead beta=2.0 in the constructor, you should use beta = MyBetaInitializer().

If you do not want them to change during learning, it has to be changed: instead of trainable=True use trainable = False, in build method of RBFLayer.

I reviewed the theory again. The width equation is derived from center parameter. But in reality,there should be a width coefficient df, like this So the learning parameter in RBF layer maybe the center and width coefficient. But the problem is that, in your code the forward inference is : Refering to the theory, it should be: In these two equations, they will have different derivative for center parameter: This difference will lead to different update for center parameter during learning~~~ My opinion is that the parameter settings in your code is still the same as before, but the inference function should be changed:

PetraVidnerova commented 3 years ago

That is great comment! You are perfectly right.

Basically, there are several approaches to RBF network training. The basic one is to 1. set centers, 2. set widths, 3. calculate output weights, one by one.

The training in my code uses backpropagation and tunes all network params simultaneously. Back prop needs random initialization, however center are widths are not initialized randomly but to meaningful values. I use random or k-means init for centers, and for widhts just uniform values.

You approach to widths suggest, that they depend on the values of centers. So the change of centers during learning should also influence the values of widths. If centers were constant, we can hide the term (sqr(1/n sum ...)) in beta, but since we are changing them, we should ad this to the computation of h2.

1165048017 commented 3 years ago

That is great comment! You are perfectly right.

Basically, there are several approaches to RBF network training. The basic one is to 1. set centers, 2. set widths, 3. calculate output weights, one by one.

The training in my code uses backpropagation and tunes all network params simultaneously. Back prop needs random initialization, however center are widths are not initialized randomly but to meaningful values. I use random or k-means init for centers, and for widhts just uniform values.

You approach to widths suggest, that they depend on the values of centers. So the change of centers during learning should also influence the values of widths. If centers were constant, we can hide the term (sqr(1/n sum ...)) in beta, but since we are changing them, we should ad this to the computation of h2.

Yes，the K-means is only used to initial the center parameter but changed while training. And then, if I specify the term (sqr(1/n sum ...)) explicitly, and modefied the code like that I get strange result:

But your code is right:

I get smaller training loss than your implementation(1.5925e-04 vs 6.8916e-04) ,but get worse prediction results~~~~ T_T It is really strange. Do I have a wrong modefication or misunderstand the theory? The attached code is my experiment with your data. I would appreciate it if you have time to help me with that? Thank you very much.

rbf_for_tf2.zip

PetraVidnerova commented 3 years ago

Could you try smaller number of hidden units?

1165048017 commented 3 years ago

Could you try smaller number of hidden units?

Firstly, I thought it got overfit. So, I changed the hidder units from 10 to 5. But the loss was still small and the result was still bad.

Then I find it doesn't look like that the model gets overfit, because the test data is the same as train data in your code. I just wonder why the modified code get smaller loss than yours, but get even worse predict result. In my mind, the smaller 'MSE' loss means the model fits the train data better, but the experiment seems not support it.

I checked that problem by adding the evaluation function:

the last epoch and evaluate is shown like: your code

Epoch 2000/2000
50/50 [==============================] - 0s 40us/sample - loss: 5.0368e-04 - mean_squared_error: 5.0368e-04
50/50 [==============================] - 0s 339us/sample - loss: 5.0352e-04 - mean_squared_error: 5.0352e-04

my code

Epoch 2000/2000
50/50 [==============================] - 0s 40us/sample - loss: 3.3930e-04 - mean_squared_error: 3.3930e-04
50/50 [==============================] - 0s 379us/sample - loss: 0.1876 - mean_squared_error: 0.1876

What a strange result of my code~~ T_T

PetraVidnerova commented 3 years ago

MSE is just square of difference from training data, so in this simple example it is really strange that you get smaller MSE. It looks something is going wrong. I will have to take a closer look on that. Thanks for you effort.

PetraVidnerova commented 3 years ago

And we should put some points apart to test set , since in this simple example only trainset is used and it is not correct approach.

1165048017 commented 3 years ago

In my first code, I made two simple dataset; The training set is:

def test_data2(sample_number = 1000):
    all_data = np.random.rand(sample_number*2,2)
    data1 = all_data[all_data[...,0]>all_data[...,1]]
    data2 = all_data[all_data[...,0]<=all_data[...,1]]
    y1 = np.zeros((data1.shape[0],1))#class0
    y2 = np.ones((data2.shape[0],1))#class 1

    train_data = np.vstack((data1,data2))
    train_label = np.vstack((y1,y2))

    shuffle_idx = np.arange(sample_number*2)
    np.random.shuffle(shuffle_idx)

    train_data = train_data[shuffle_idx]
    train_label = train_label[shuffle_idx]
    return train_data,train_label

the test set is:

## create dataset
def test_data1(sample_number = 1000):
    mean0=[2,7]
    cov=np.mat([[1,0],[0,2]])
    data1=np.random.multivariate_normal(mean0,cov,sample_number)

    mean1=[8,3]
    cov=np.mat([[1,0],[0,2]])
    data2=np.random.multivariate_normal(mean1,cov,sample_number)

    y1 = np.zeros((sample_number,1)) # class 0
    y2 = np.ones((sample_number,1)) # class1

    train_data = np.vstack((data1,data2))
    train_label = np.vstack((y1,y2))

    shuffle_idx = np.arange(sample_number*2)
    np.random.shuffle(shuffle_idx)

    train_data = train_data[shuffle_idx]
    train_label = train_label[shuffle_idx]
    return train_data,train_label

The predicted result seems like no problem:

But when I changed the test data to a line:

x1 = np.linspace(-2,12,1000)
x2 = np.linspace(-2,12,1000)
test_x = np.vstack((x1,x2)).T

In this line dataset, your code get reasonable result:

PetraVidnerova / rbf_for_tf2

The calculation of width parameter? #2