FangyunWei / SLRT

236 stars 46 forks source link

Fail to reproduce SingleStream-SLT #31

Closed ZechengLi19 closed 10 months ago

ZechengLi19 commented 11 months ago

Thanks for your awesome work! And I try to reproduce SingleStream-SLT. However, I found that running code in different time resulted in different performance. I ran it (train.log) three times on the phoenix-2014t dataset and got three different results.

1. WER: 25.55 bleu1 46.40 bleu2 33.77 bleu3 26.34 bleu4 21.54 ROUGE: 44.89
2. WER: 25.99 bleu1 37.25 bleu2 23.97 bleu3 17.39 bleu4 13.63 ROUGE: 35.73
3. WER: 24.70 bleu1 53.50 bleu2 40.81 bleu3 32.86 bleu4 27.36 ROUGE: 51.89

It looks like the set_seed function is not working well. Have you observed this result before?

ChenYutongTHU commented 11 months ago

Hi. Thanks. Yes. I did observe the randomness.

The reason might be that some torch operations, e.g. torch.nn.CTCLoss, involve non-deterministic behaviors. Please refer to https://pytorch.org/docs/stable/generated/torch.use_deterministic_algorithms.html#torch.use_deterministic_algorithms.

The randomness leads to some jitter of the performance. You can run multiple times and use the best result.

ZechengLi19 commented 11 months ago

Hi. Thanks. Yes. I did observe the randomness.

The reason might be that some torch operations, e.g. torch.nn.CTCLoss, involve non-deterministic behaviors. Please refer to https://pytorch.org/docs/stable/generated/torch.use_deterministic_algorithms.html#torch.use_deterministic_algorithms.

The randomness leads to some jitter of the performance. You can run multiple times and use the best result.

Thanks for your quick reply! However, is it normal for the performance to differ so much each time?

ChenYutongTHU commented 10 months ago

Hello. I also observed the random fluctuation. This might be due to the relatively small size of the training data. In fact, some other work using the Phoenix dataset has similar problems, i.e. Sign Language Transformers . The suggestion from the dataset creator is to run the model multiple times and choose the one with the highest validation score.

Hi. Thanks. Yes. I did observe the randomness. The reason might be that some torch operations, e.g. torch.nn.CTCLoss, involve non-deterministic behaviors. Please refer to https://pytorch.org/docs/stable/generated/torch.use_deterministic_algorithms.html#torch.use_deterministic_algorithms. The randomness leads to some jitter of the performance. You can run multiple times and use the best result.

Thanks for your quick reply! However, is it normal for the performance to differ so much each time?

ZechengLi19 commented 10 months ago

Hello. I also observed the random fluctuation. This might be due to the relatively small size of the training data. In fact, some other work using the Phoenix dataset has similar problems, i.e. Sign Language Transformers . The suggestion from the dataset creator is to run the model multiple times and choose the one with the highest validation score.

Hi. Thanks. Yes. I did observe the randomness. The reason might be that some torch operations, e.g. torch.nn.CTCLoss, involve non-deterministic behaviors. Please refer to https://pytorch.org/docs/stable/generated/torch.use_deterministic_algorithms.html#torch.use_deterministic_algorithms. The randomness leads to some jitter of the performance. You can run multiple times and use the best result.

Thanks for your quick reply! However, is it normal for the performance to differ so much each time?

Get! Thanks for your reply. It is really helpful for me!