Problem of batch_size - Githubissues

NLPAlchemist commented 11 months ago

Why does it only work if batch_size=1, once it is greater than 1 the following happens： epoch: 1, train_loss: nan, train_acc: 10.51, train_fscore: 5.0, valid_loss: nan, valid_acc: 6.78, valid_fscore: 0.86, test_loss: nan, test_acc: 8.87, test_fscore: 1.45, time: 1.87 sec epoch: 2, train_loss: nan, train_acc: 8.92, train_fscore: 1.46, valid_loss: nan, valid_acc: 6.78, valid_fscore: 0.86, test_loss: nan, test_acc: 8.87, test_fscore: 1.45, time: 0.57 sec epoch: 3, train_loss: nan, train_acc: 8.92, train_fscore: 1.46, valid_loss: nan, valid_acc: 6.78, valid_fscore: 0.86, test_loss: nan, test_acc: 8.87, test_fscore: 1.45, time: 0.6 sec epoch: 4, train_loss: nan, train_acc: 8.92, train_fscore: 1.46, valid_loss: nan, valid_acc: 6.78, valid_fscore: 0.86, test_loss: nan, test_acc: 8.87, test_fscore: 1.45, time: 0.55 sec epoch: 5, train_loss: nan, train_acc: 8.92, train_fscore: 1.46, valid_loss: nan, valid_acc: 6.78, valid_fscore: 0.86, test_loss: nan, test_acc: 8.87, test_fscore: 1.45, time: 0.62 sec

gcp666 commented 10 months ago

I have the same problem. Have you solved it？

shenyujie1125 commented 10 months ago

I just downloaded the code and dataset, and then ran the code in the way in the readme, and saw that the parameters were exactly tI just downloaded the code and dataset, and then ran the code in the way in the readme, and saw that the parameters were exactly the same as the parameters in your paper, but I just couldn't train, and the training indicators didn't change, why is this, did you upload the wrong version? I also change the batchsize, the problem still exists.

18438622356 commented 10 months ago

I also have the same problem. How to solve it？

butterfliesss commented 10 months ago

Our code uses PyTorch 1.4.0 and runs without any issues. Due to the small size of the dataset, we did not set a random seed and instead chose to run the code multiple times. Additionally, to ensure a fair comparison, we used the same data partitioning as other comparison methods such as DialogueRNN and MMGCN et al, and did not use a validation set in practice.

butterfliesss commented 10 months ago

@shenyujie1125 If the environment and hyperparameters are consistent with ours, you can run it several times, trying different random seeds.

shenyujie1125 commented 10 months ago

Then I'm wondering, is it a problem with the pytorch version, I don't think so, but I just downloaded the code and ran it directly, why did there be such a different result, I ran it again a minute ago, and the result is still like this.

shenyujie1125 commented 10 months ago

@butterfliesss I'm going to experiment with your environment right away using the PyTorch 1.4.0

shenyujie1125 commented 10 months ago

@butterfliesss Thanks for helping me with my questions.

butterfliesss commented 10 months ago

@shenyujie1125 You‘re welcome. Setting up a new environment identical to ours should solve your problem.

NLPAlchemist commented 10 months ago

Is it possible that the self-distillation module is the problem, because when I removed the module, the model can run normally without the previous gradient explosion and the corresponding batchsize setting problem

shenyujie1125 commented 10 months ago

@NLPAlchemist You have the same problem with me. Since I only have a 3090 graphics card, I can't install pytoch1.4, because of the problem with the cuda version, I think the torch version should not have much impact, I just went to debug the code, I found that there is indeed a gradient explosion problem, which will make the output of the model all Nan this is the parameters I'm going to use my Windows computer to see if it's a PyTorch version issue

NLPAlchemist commented 10 months ago

Ok, if it is the pytorch version, please leave a reply. Thank you

butterfliesss commented 10 months ago

I am confident that there is no problem with using our environment, as I was able to reproduce the code last night after reconfiguring the environment. Additionally, introducing self-distillation indeed causes some degree of instability in model training.

shenyujie1125 commented 10 months ago

@butterfliesss You're right, when I deployed the pytorch1.4.0 cu10.1 environment on my windows laptop, the above problem did disappear, which shocked me, it turned out to be because of the torch version problem, I have never encountered it before, thank you very much for your help, now the problem is solved.

shenyujie1125 commented 10 months ago

@NLPAlchemist the problem is solved

1260629937 commented 9 months ago

How to train? When I run bash exec iemocap.sh, the generated sdt iemocap.txt file only contains parameters.

butterfliesss / SDT

Problem of batch_size #3