Different metric values for training with train_batch vs atomic functions

georgemavrakis-wings commented 3 years ago

Good afternoon,

I have encountered different results in the loss and metric values when trying to train a network with train_batch function and with atomic ones.

More specifically, here is my code in the two versions after the network build (in python):

1st version (atomic functions - https://github.com/deephealthproject/eddl/blob/master/examples/nn/1_mnist/10_mnist_rnn_func.cpp):

for epoch in range(1, 3):
    eddl.set_mode(net=network_obj, mode=1)
    eddl.reset_loss(network_obj)

    num_of_batches = Xtrain.shape[0] // batch_size
    xbatch = Tensor([batch_size, Xtrain.shape[1], Xtrain.shape[2]])
    ybatch = Tensor([batch_size, ytrain.shape[1]])

    for b in range(1, num_of_batches + 1):

            eddl.next_batch([Xtrain, ytrain], [xbatch, ybatch])
            eddl.zeroGrads(network_obj)
            eddl.forward(network_obj, [xbatch])
            eddl.backward(network_obj, [ybatch])
            eddl.update(network_obj)

            eddl.print_loss(network_obj, batch=b)
            print('\n')

atomic_functions

2nd version (train_batch):

for epoch in range(1, 3):
    eddl.set_mode(net=network_obj, mode=1)
    eddl.reset_loss(network_obj)

    num_of_batches = Xtrain.shape[0] // batch_size
    xbatch = Tensor([batch_size, Xtrain.shape[1], Xtrain.shape[2]])
    ybatch = Tensor([batch_size, ytrain.shape[1]])

    for b in range(1, num_of_batches + 1):

            eddl.next_batch([Xtrain, ytrain], [xbatch, ybatch])
            eddl.train_batch(network_obj, [xbatch], [ybatch])

            eddl.print_loss(network_obj, batch=b)
            print('\n')

train_batch

I am not sure what I am doing wrong.

Thank you.

RParedesPalacios commented 3 years ago

Ok let me check it!

RParedesPalacios commented 3 years ago

In my side using C++ version.

train_batch was not ready for RNN. Now it is in the develop branch so wait for the master release and python binding. In any case the atomic functions version works in my side. In particular I am running this example:

https://github.com/deephealthproject/eddl/blob/develop/examples/nn/1_mnist/10_mnist_rnn_func.cpp

and I get this:

Epoch 1 Batch 599 share_27softmax3 ( loss[softmax_cross_entropy]=0.9510 metric[categorical_accuracy]=0.6688 ) -- Epoch 2 Batch 599 share_27softmax3 ( loss[softmax_cross_entropy]=0.1661 metric[categorical_accuracy]=0.9486 ) --

Although train_batch still doesn't work in your version, atomic functions should do it. Perhaps yo don't get an appropriate result in your example for atomic functions due to some optimiser setup, learning rate etc...

RParedesPalacios commented 3 years ago

Also I think you are using an old version since ybatch should be ybatch = Tensor([batch_size, 1, ytrain.shape[1]])

so in the new version of recurrent we have also to inform about the length of the ybatch, in this case "1" since is a unique output for all the input sequence but in general we could have a sequence as well.

In any case please check:

https://github.com/deephealthproject/eddl/blob/develop/examples/nn/1_mnist/10_mnist_rnn_func.cpp

to see how is done in the new version in C++.

georgemavrakis-wings commented 3 years ago

Hello,

I did not use recurrent network in this case. I used a CNN network. And all the code is the same in the two versions.

In any case, I will check further. Thank you.

RParedesPalacios commented 3 years ago

Hello, understood, since you pointed out to this example:

https://github.com/deephealthproject/eddl/blob/master/examples/nn/1_mnist/10_mnist_rnn_func.cpp

I thought that you are using a RNN and then i mentioned the problem with train_batch...

Not sure what could happen but If you want you can write here the whole code to let me check.

And check that you have updated to the last version.

georgemavrakis-wings commented 3 years ago

Good morning,

since you pointed out to this example: https://github.com/deephealthproject/eddl/blob/master/examples/nn/1_mnist/10_mnist_rnn_func.cpp

you are right, my mistake for I did not explain the use of the link. This link was used to explain the atomic functions used for training.

And check that you have updated to the last version.

I use pyeddl version 0.13.0

Not sure what could happen but If you want you can write here the whole code to let me check.

The full code is the following (I show an example with a smaller network):

way = 'analytical'             # atomic functions or train_batch

 # Network:
X = np.load(os.path.join('.', 'Xtrain.npy'))
Y = np.load(os.path.join('.', 'Ytrain.npy'))

input = eddl.Input(shape=[X.shape[1], X.shape[2]], name='INPUT')
layer = input
parent_layer = eddl.Conv1D(parent=layer, filters=128, kernel_size=[5], padding="same")
parent_layer = eddl.Unsqueeze(parent=parent_layer, axis=2)
parent_layer = eddl.BatchNormalization(parent=parent_layer, affine=True, momentum=0.1)
parent_layer = eddl.Squeeze(parent=parent_layer)     # return to 3D
parent_layer = eddl.Selu(parent=parent_layer)
parent_layer = eddl.GlobalAveragePool1D(parent=parent_layer)
parent_layer = eddl.Squeeze(parent=parent_layer)
parent_layer = eddl.Dense(parent=parent_layer, ndim=2)
out = eddl.Activation(parent=parent_layer, activation='softmax')

net = eddl.Model(in_=[input], out=[out])

eddl.build(
        net=net,
        o=eddl.adam(0.001),
        lo=["soft_cross_entropy"],
        me=["categorical_accuracy"],
        cs=eddl.CS_GPU())

X_train = Tensor.fromarray(X)
y_train = Tensor.fromarray(Y)

# Training:

for epoch in range(1, 2 + 1):

    print('*'*30 + '\nEpoch {}\n'.format(epoch) + '*'*30)
    eddl.set_mode(net=net, mode=1)

    eddl.reset_loss(net)

    num_of_batches = X.shape[0] // 16
    xbatch = Tensor([16, X_train.shape[1], X_train.shape[2]])
    ybatch = Tensor([16, y_train.shape[1]])

    for b in range(1, num_of_batches + 1):

        eddl.next_batch([X_train, y_train], [xbatch, ybatch])

        if way == 'analytical':
            eddl.zeroGrads(net)
            eddl.forward(net, [xbatch])
            eddl.backward(net, [ybatch])
            eddl.update(net)
        else:
            eddl.train_batch(net, [xbatch], [ybatch])

        eddl.print_loss(net, batch=b)
        print('\n')

The above code gives the following metric results:

atomic functions: Epoch 1: Batch 134 softmax2 ( loss[softmax_cross_entropy]=0.6554 metric[categorical_accuracy]=0.6143 ) Epoch 2: Batch 134 softmax2 ( loss[softmax_cross_entropy]=0.5615 metric[categorical_accuracy]=0.6861 )
train_batch: Epoch 1: Batch 134 softmax2 ( loss[softmax_cross_entropy]=0.6615 metric[categorical_accuracy]=0.5690 ) Epoch 2: Batch 134 softmax2 ( loss[softmax_cross_entropy]=0.5177 metric[categorical_accuracy]=0.7621 )

RParedesPalacios commented 3 years ago

Hi, please run two consecutive runs with the same setup, i mean for instance with train_batch. Not sure if the randomisation is affecting.

georgemavrakis-wings commented 3 years ago

Hello,

so I ran two sequential runs. The results are pretty much the same:

Run 1 (atomic functions): Epoch 1: Batch 134 softmax2 ( loss[softmax_cross_entropy]=0.6554 metric[categorical_accuracy]=0.6143 ) Epoch 2: Batch 134 softmax2 ( loss[softmax_cross_entropy]=0.5615 metric[categorical_accuracy]=0.6856 ) Run 2 (atomic functions): Epoch 1: Batch 134 softmax2 ( loss[softmax_cross_entropy]=0.6554 metric[categorical_accuracy]=0.6143 ) Epoch 2: Batch 134 softmax2 ( loss[softmax_cross_entropy]=0.5615 metric[categorical_accuracy]=0.6861 )

Run 1 (train_batch): Epoch 1: Batch 134 softmax2 ( loss[softmax_cross_entropy]=0.6615 metric[categorical_accuracy]=0.5690 ) Epoch 2: Batch 134 softmax2 ( loss[softmax_cross_entropy]=0.5177 metric[categorical_accuracy]=0.7621 ) Run 2 (train_batch): Epoch 1: Batch 134 softmax2 ( loss[softmax_cross_entropy]=0.6615 metric[categorical_accuracy]=0.5690 ) Epoch 2: Batch 134 softmax2 ( loss[softmax_cross_entropy]=0.5177 metric[categorical_accuracy]=0.7621 )

georgemavrakis-wings commented 3 years ago

Btw I ran for 10 epochs and the results from train_batch are in the last epoch: Batch 134 softmax2 ( loss[softmax_cross_entropy]=0.2202 metric[categorical_accuracy]=0.9184 ) and from analytical functions: Batch 134 softmax2 ( loss[softmax_cross_entropy]=0.2626 metric[categorical_accuracy]=0.8923 )

which is much closer, compared with the two-epoch run. So, maybe the convergence of the analytical functions is just slower?

RParedesPalacios commented 3 years ago

Ok, thanks, clearly the differences are not "critical" but not sure why are not almost the same... we have to check it

@salvacarrion please any thought

salvacarrion commented 3 years ago

At first sight, no idea. I need to run some tests

chavicoski commented 3 years ago

Hello, I have found what causes the error. The forward function from the analytical mode performs the forward in inference mode (always), so the outputs from the forward are not correct and the training is not equivalent. I fixed it on the develop branch so it will be available in the next release. Thank you!!

georgemavrakis-wings commented 3 years ago

Hello, @chavicoski thank you!

deephealthproject / eddl

Different metric values for training with train_batch vs atomic functions #271