Different results of recurrent GRU network during different executions

georgemavrakis-wings commented 3 years ago

Good afternoon,

I have implemented a basic GRU -> Dense -> Softmax architecture. I split the dataset to train, validation and test sets. During the training, the model's weights are saved based on the model's performance on the validation set. When I measure the performance on the test set, the results are different between different runs of the whole experiment. For example, the confusion matrix for run 1 is: [[0 334], [0 74]], for run 2: [[12, 322], [4 70]] and for run 3: [[94, 240], [21 53]].

I cannot figure the cause of the problem.

In all experiments, the data that is fed to the network is the same. Pyeddl version: 1.0.0. In the training/inference, the train_batch/eval_batch are used.

Thank you in advance.

jonandergomez commented 3 years ago

Dear George, good afternoon too,

The behavior you are describing can be done to several factors. Learning rate, for instance. Additionally, the net is initialized randomly at every run, what can explain the differences in the results. Anyway, results are really bad, even bizarre, because the misclassification by assigning all the samples to one class, the class with more samples (in training), usually leads to confusion matrices that are the opposite. Let me ask some questions: 1.- If you are working with a data transformation to feed the net, can we consider data is anonymized and, then, can we at UPV perform some tests with the same data? Either with similar topologies and with different ones. 2.- Could you create a new instance of your topology and save it to a file just after the initialization? In the affirmative case, could you execute several runs with different learning rates? I hope that if you use the same data (with no shuffle for training) you will get the same results in every run using the same learning rate. If so, we can discard a possible memory leak. Additionally, with this experiment we will analyze the impact of the learning rate in the combination of dataset+net-topology. 3.- Can you try the same network topology with the same data using another topology?

Let me apologize, but we need your collaboration.

Regards,

Jon

jonandergomez commented 3 years ago

Sorry, in the last question of my previous entry in this thread I would say: 3.- Can you try the same network topology with the same data using another toolkit?

Regards,

Jon

georgemavrakis-wings commented 3 years ago

Dear Jon, Sorry for my delayed response.

Considering the 1st question, I provide you the transformed dataset that it is fed to the network in a python dictionary. The 'X' key contains the data and the 'y' key the labels.

Considering the 2nd question, the runs are executed with the same epochs, batch-size and learning rate. I have also executed different runs with different learning rates:

1e-3: run 1: [[129 23], [30 10]], run 2: [[0 152], [0 40]], run3: [[0 152], [0 40]] 1e-4: run 1: [50 102], [12 28]], run 2: [[84 68], [14 26]], run3: [[162 172], [26 48]] 1e-5: run 1: [[58 94], [21 19]], run 2: [[13 139], [3 37]], run3: [[58 94], [17 23]]

Considering the 3rd question, I have tried to run the same experiments with pytorch. The results that I receive between different runs are pretty much the same with each other. The confusion matrix is for lr = 1e-4: run1: [[104 54], [19 21]], run2: [[106 52], [19 21]]

Best regards, George

data.tar.gz

jonandergomez commented 3 years ago

Thanks a lot! I will test it as soon as possible and give you an answer.

Jon

jonandergomez commented 3 years ago

Dear George,

Finally we manage to cope with this issue. In the next link you can find a more detailed answer hoping it will be useful to other people.

https://github.com/deephealthproject/eddl_issues/tree/main/issue_294

Basically, we did not observe the behavior you reported in this issue. There you will find Python code with a very basic example to test it.

We observed that is is necessary to use Adam optimizer to allow our basic topology to learn something. However, you will see in the provided log files that this naive topology over fits, so no satisfactory results from the point of view of accuracy can be achieved with the sample data, but we can close this issue in the sense that the results reported show that the GRU layer type, and also the LSTM layer type, both perform as expected.

If you need any further assistance to review your code we can arrange a meeting.

I will close this issue thread as soon as you confirm the problem is fixed.

Regards,

Jon

jonandergomez commented 3 years ago

Hi again,

We have to continue reviewing this issue because today, repeating experiments, we found discrepancies higher than expected. We do not observe as big differences as the ones shown by George, but the differences we are observing in the results between different runs when using the same initialization and not shuffling training samples cannot be accepted.

Sorry, we will back with news as soon as possible.

Regards,

Jon

georgemavrakis-wings commented 3 years ago

Good morning,

Thank you a lot for the effort. I will check and format the code in the eddl_issues to the one that I used to run the experiments, and I will report any finding of note.

Best, George

georgemavrakis-wings commented 3 years ago

Hello,

I have managed to take stable relatively performance between the experiments. However, the confusion matrix that I take for learning rate = 1e-4 is really bad and different compared with pytorch. For instance: the confusion matrix from EDDL is: [[331 3], [74, 1]] and for pytorch: [[268, 66], [53, 22]].

When using higher learning rate (1e-2), the performance is getting better.

jonandergomez commented 3 years ago

Hi again,

I just completed the repository https://github.com/deephealthproject/eddl_issues/tree/main/issue_294 with interesting results.

A lot of runs with different configurations have been carried out, and the discrepancies appear when using optimizers Adam and RMSprop. All details, code and graphical results are available in the repository.

At this moment we can close this issue, that can be reopened if required. I would ask you if you can run similar experiments with PyTorch.

Regards,

Jon

deephealthproject / eddl

Different results of recurrent GRU network during different executions #294