UKPLab / emnlp2017-bilstm-cnn-crf

BiLSTM-CNN-CRF architecture for sequence tagging
Apache License 2.0
822 stars 263 forks source link

Performance difference #19

Open sa-j opened 6 years ago

sa-j commented 6 years ago

Hi there,

Thank you for uploading your implementation of the NER Tagger! Can you please tell me, with which settings it is possible to replicate the performance of glample's NER tagger on German conll data while using the original embeddings? In 100 epoch, the highest value I get is around 71% (with Theano backend for BiLSTM-CNN-CRF v. 1.2.2) or 70% (with Tensorflow 1.8 backend for BiLSTM-CNN-CRF v. 2.2.0) while using the original configurations

params = {'classifier': 'CRF', 'LSTM-Size': [64], 'dropout': (0.5), 'charEmbeddings': 'LSTM', 'charEmbeddingsSize':'30', 'maxCharLength': 50, 'optimizer': 'sgd', 'earlyStopping': 30}

and further using the IOB Tagging. Do you know how to solve this issue?

Thanks!

nreimers commented 6 years ago

'SGD' is extremely sensitive to the learning rate and you must carefully select / tune the learning rate to achieve a good performance.

Adam is in most cases a much better optimizer and it is also much less dependent on the setting of its hyperparameters.

With the settings in Train_NER_German.py you achieve a performance for CoNLL 2003 German of about: Dev: 81.15%, Test:77.70%

sa-j commented 6 years ago

I have tried out with the settings in Train_NER_German.py (apart from 'LSTM-size': [64] due to the original embeddings), but can't reach the performance you just reported. I'm getting Max: 0.7837 dev; 0.7999 test. What can be the reason for that?

nreimers commented 6 years ago

0.799 F1 score on the test set sounds really good.

But as noted / explained in my papers: https://arxiv.org/abs/1707.09861 https://arxiv.org/abs/1803.09578

The performance of deep learning models can vary heavily depending on the random seeds. Some random initializations lead to good models, others lead to worse performing models.

For the CoNLL 2003 NER-De dataset, the test-score difference between runs can be as large as 3.36 percentage points F1-score only by choosing a different random seed.

This issue is not specific to the implementation here, but affects any deep learning system: Some minima generalize well while others generalize not that well to new data.

sa-j commented 6 years ago

I have read your paper and I definitely see the point. But how comes that while using the model of NER tagger the overall performance for conll can reach 80.5 %, while on your model it is not reaching that value even after trying out many runs. There might be a difference between both implementations, even if minor. What do you think?

nreimers commented 6 years ago

Hi @sa-j The best performance Lample et al. report in their paper is 78.76% for CoNLL 2003 NER-De (test set), while you achieve 79.99% F1-Score (test set). The pretrained model achieved 77.70%, but I run it only with one random seed, so it might be a good or a bad random seed.

There are several differences in the code here and in the code of Lample et al.

1) Pretrained embeddings I use pretrained embeddings I created for GermEval using skip-gram from word2vec, while Lample used skip-n-gram, which observes word order. Different pretrained embeddings can make a big difference of several percentage points F1-score.

2) Optimizer The code of Lample uses SGD, an optimizer which is much harder to use as the learning rate must carefully be selected. Also, training is much slower with SGD. I recommend Adam, which is easier to use and faster, however, if finds often minima which are a bit worse than SGD. A good combination can be to start with Adam and then using SGD for the final epochs.

3) Hyperparameters The parameters in Train_NER_German.py were not really optimized, I used just some sensible values without aiming to squeeze out the last 0.x percentage points.

4) Different versions of datasets Getting the CoNLL 2003 NER dataset is due to copyright issues not so easy and many people use some inofficial copy for their experiments. I'm aware of at least two versions, one with the original annotations and one with some updated annotations (but where it is not clear where this version came from). So when comparing results for CoNLL 2003 NER-De it is important to look at which version was used and if these versions are really comparable.

5) Different evaluation methods Classifier can produce invalid BIO tags, for example an I-PER tag without a previous B-PER tag. Some evaluation methods (for example the CoNLL 2000 evaluation script) ignore these invalid BIO tags, so if the correct tag is B-PER and the predicted tag is I-PER, it counts it as a match. Other implementations of the F1 measure requires valid BIO-encoding, there, I-PER might be converted to an O tag and it would not match with a gold B-PER tag. Sometimes, invalid BIO encoding leads to some unpredictable behavior in the evaluation scripts, for example when you have the sequence B-PER I-LOC I-LOC => what should be the valid BIO-encoding for this sequence?.

sa-j commented 6 years ago

Thank you for your detailed answer! I see the importance of all points, especially of 5. As far as I have gone through your code, you do not use the Conll 2000 evaluation script directly. This can be part of those reasons, just as you explained.

nreimers commented 6 years ago

That's correct, I don't use the CoNLL 2000 eval script, as it is rather slow and would require that perl is also installed.

I tested my implementation and it produced for me the same results as the CoNLL 2000 eval script if the BIO encoding is valid.

For invalid BIO encoding, the provided code uses two post-editing strategies to ensure a valid BIO encoding: Set invalid tags to O (i.e. O I-PER => O O) or set invalid tags to B (i.e. O I-PER => O B-PER).

I'm not sure how the conll 2000 scripts deals with invalid encoding.

Zawan-uts commented 5 years ago

Which evaluation script you use in this and elmo implementation? Does that script ignore invalid tags?

nreimers commented 5 years ago

If all tags are valid, then the evaluation script (implemented in Python) produces the same scores as the CoNLL 2003 perl script.

In the experiments there are two methods how to deal with invalid BIO tags: Set them to O (I-PER I-PER => O O) or start a new tag with B (I-PER I-PER => B-PER I-PER). Both methods ensure, that there are no invalid tags.

If invalid tags are passed to the evaluation script (without the described fix), then they are considered as an error. This is different to the CoNLL 2003 perl script, there, invalid tags are not an error.