Closed ZhaofengWu closed 6 years ago
That is correct.
The network is initialized randomly. With each run, the random initialization is different and the network converges to a different minimum / convergence point. Each minimum achieves different scores on dev and test set.
Fixing the random initialization in order to get fixed (deterministic) scores is bad science and should be avoided. With fixed random seed it will not possible to compare two approaches or two configurations, as you cannot be sure whether one approach is actually better than the other or if just one approach had a better (luckier) random initialization.
What you should always do with neural networks is to train them multiple times, e.g. 10 times, and to average the scores.
I discussed this issue in detail in my two publications:
Sorry I think I was not clear enough. I was referring to after finishing training a model, I use the model to do prediction on my test set. There is randomness on test set performance every time I do prediction using the same model. I didn't see this in the previous emnlp2017-bilstm-cnn-crf architecture.
I don't know if there is randomness in ELMo's BiLM initialization for generating test set embeddings, but even if I create the embeddings beforehand using Create_ELMo_Cache.py, there is still randomness.
Ah okay, got you.
See this readme from AllenNLP about "Notes on statefulness and non-determinism": https://github.com/allenai/allennlp/blob/master/tutorials/how_to/elmo.md#notes-on-statefulness-and-non-determinism
The implementation of AllenNLP for the computation ELMo embeddings is sadly non-deterministic, i.e., each time you compute the embeddings, they can change slightly and with that the performance of the system might change.
I have not yet evaluated how bad this is, i.e. what are the side effects that the ELMo embeddings are non-deterministic.
How big are the differences in F1-score on your test data when you load & execute the same model multiple times? And how big is the test set?
I would be really interested to figure out what are the consequences of the non-determinism of ELMo.
2641 sentences which include 69831 tokens, and F1 varies within 2% (absolute).
But even if the computation of the ELMo embedding is non-deterministic, why is there still randomness if I precompute the embedding using the Create_ELMo_Cache.py? (i.e. same model, same precomputed embeddings)
2% absolute difference is quite large, didn't expect that.
I'm not sure where the non-determinism comes from. When all test sentences are in the ELMo cache, they should always be the same at inference time, i.e., no place for non-determinism.
I will check this and come back to you, hopefully with an answer.
I found and fixed the bug. However, the bug is only fixed for new models, existent models will still be affected that the performance when the model is re-evaluated will be different from the original performance.
The issue was with the word embeddings. The class EmbeddingLookup.py loads the pre-trained word embeddings and adds a new special token 'UNKNOWN_TOKEN' to the vocabulary. For this token, a random word embeddings is generated.
When you reload the model, a new random word embedding is generated for UNKNOWN_TOKEN, which can change the behavior of the model, especially when you have a lot of unknown tokens.
I changed the code so that always the same (but random) embedding is generated for UNKNOWN_TOKEN.
Thanks!
Looks like each time I run it, the result F1 changes.