dmlc / gluon-nlp

NLP made easy
https://nlp.gluon.ai/
Apache License 2.0
2.55k stars 538 forks source link

[BERT] BERT for Named Entity Recognition #593

Open bikestra opened 5 years ago

bikestra commented 5 years ago

Since that gluon-nlp already has very good tools for BERT and also that it has basic data processing for named entity recognition ready from https://github.com/dmlc/gluon-nlp/pull/466 , I wanted to build upon these and implement BERT for NER. I have started implementing this for a personal project, but please let me know if you have concerns or suggestions.

eric-haibin-lin commented 5 years ago

Thanks! That will be great. I am evaluating the BERT model I trained from scratch. Do you plan to reproduce result reported by BERT on the CoNLL 2003 dataset? If so, that will be very helpful for evaluating my model on the NER task :)

bikestra commented 5 years ago

Yes, I will mainly aim to reproduce BERT result on CoNLL 2003 English. I may try couple other datasets like OntoNotes or CoNLL 2002 Spanish if this goes well!

fierceX commented 5 years ago

I also implemented a pre-training NER project with reference to https://github.com/kyzhouhzau/BERT-NER, which was originally used to play the couplet. If there is any problem, you can discuss it together.

bikestra commented 5 years ago

I am reusing BIO/BIOES data processing code (utils_func) from https://github.com/dmlc/gluon-nlp/pull/466 ; should we move this to gloun.nlp.data? I guess code under /script is not supposed to be shared across scripts? Also it is unclear this could should go to /script/bert or /script/named_entity_recognition. Any suggestions?

szha commented 5 years ago

@bikestra right. If you find the utility useful, feel free to propose it as a utility function in gluonnlp.data. Alternatively, there's already a number of datasets in conll in gluonnlp. Maybe we can expose a base class for conll formats too. In case you wonder, we didn't offer conll2003 due to license.

szha commented 5 years ago

@bikestra do you intend to introduce the hybridizable CRF too?

bikestra commented 5 years ago

@szha I was planning to start with simple softmax loss and replicate the BERT paper, but it sounds interesting to try.

parry2403 commented 5 years ago

@bikestra @fierceX I was planning to start on the same task. Do you have initial version of the commit ?

szha commented 5 years ago

@bikestra @fierceX @parry2403 feel free to open a feature branch for collaboration if that's helpful to you.

fierceX commented 5 years ago

@bikestra Can you commit an initial version?

bikestra commented 5 years ago

@fierceX @parry2403 I have an initial version of the code but it is very rough and does not yet achieve good F1 scores yet. Let me do a minimal exploration & cleanup and commit an initial version by tomorrow. Would it be OK?

parry2403 commented 5 years ago

@bikestra Sure. I have seen BERT forums where reproducing CoNLL scores has been issue since authors used document context.

bikestra commented 5 years ago

@fierceX were you able to reproduce results from https://github.com/kyzhouhzau/BERT-NER and replicate with gluon-nlp? I am getting F1 scores only around 89. I am using bert_12_768_12 instead of bert_24_1024_16 but according to the paper that should make only a small difference... I will anyways share what I have tomorrow, but was curious of your experience.

fierceX commented 5 years ago

@bikestra I haven't scored an F1 yet, so let's take a look at your implementation.

bikestra commented 5 years ago

I get an F1 score around 0.91 with smaller learning rates, so it might just be a hyperparameter tuning problem. I will explore a bit.

parry2403 commented 5 years ago

This might be close to upper bound since based on lot of discussions from other frameworks, it seems that they got 92.2 with document context rather than sentence context.

bikestra commented 5 years ago

I tried a lot of tricks but could not improve beyond test F1 of 91.7. The package @fierceX mentioned does much worse than this, because their dev F1 score (which is much easier than test F1, usually 3 point higher) is only 93 https://github.com/kyzhouhzau/BERT-NER/issues/2 .

Is anyone aware of the codebase that successfully replicated BERT paper's results on CoNLL 2003 English?

szha commented 5 years ago

BERT base model's finetuning result on conll2003, on test F1 is 92.4, so it's close now. They mentioned in the paper that they did this without using the context around, and the scoring is based only on the first subtoken of each word, with later subtokens all omitted. Is that what you did too?

bikestra commented 5 years ago

@szha Yes, 91.7 is close but I think 0.7% point is a meaningful gap, since non-contextual word embeddings can get pretty close to this. And yes, I am only predicting the first subtoken of each word. I actually tried different strategies - predicting every subtoken to IOBES, and then using CRF on top of it - but any of these things didn't help.

@parry2403 kindly shared me the link to the relevant conversation which claims BERT authors did use document context: https://github.com/allenai/allennlp/pull/2067#issuecomment-443961816

bikestra commented 5 years ago

With BERT large, I could get dev F1 96.0 and test F1 91.9. This would not be a state-of-the-art NER tagger (you can certainly do better with ELMo) and takes tens of iterations which is unusually slow, but is better than non-contextual word embeddings, and I couldn't find anyone who got better than this with BERT. Would this still be a good addition to gluon-nlp? If so, I could do a clean-up and update the pull request.

szha commented 5 years ago

@bikestra yes, let's add that as is. Once it got more eyes, maybe someone who work on NER in the community could offer some hint. I actually heard a similar story on sequence tagging performance of BERT from @jdchoi77. We discussed that it might be due to the BPE tokenization, because sequence tagging for Chinese seems to work well where the tokenization is usualy already at character level. If that's indeed the case, we could potentially do the subword sampling that's offered from sentencepiece to learn more robust subword representation, or do character-level embedding similar to ELMo.

bikestra commented 5 years ago

Sure, will do the clean up and push. FWIW I could get dev F1 96.1 and test F1 92.2 so this is comparable to reported ELMo numbers. Of course this is somewhat overfitting to test F1 score...

jdchoi77 commented 5 years ago

@szha, @bikestra, The CoNLL03 model is probably not the most useful one in practice. We should release the one trained on OntoNotes instead which gives more entity types and diversities in genres.

bikestra commented 5 years ago

@jdchoi77 I agree OntoNotes is much more interesting than CoNLL 2003 Eng, and I can certainly provide numbers on it. But they wouldn't be useful for the benchmarking purpose since ELMo/BERT papers and their implementations don't reports OntoNotes results.

jdchoi77 commented 5 years ago

@bikestra For a benchmarking purpose, CoNLL'03 is indeed better. We are going to provide scores on OntoNotes for POS, NER, dependency parsing, and semantic parsing so hopefully people will use it for new benchmarks.

hankcs commented 5 years ago

Sure, will do the clean up and push. FWIW I could get dev F1 96.1 and test F1 92.2 so this is comparable to reported ELMo numbers. Of course this is somewhat overfitting to test F1 score...

@bikestra Could you share your hyperparameter settings for test F1 92.2? I saw your PR. Are the hyperparameters what you used to produce 92.2? It's pretty close to 92.8. Actually Facebook is using smaller lr for LM during fine-tuning. I'd like to try that on top of your work.

hankcs commented 5 years ago

The default hyperparameters result in a quite low score:

dev f1: 0.934 test f1: 0.896
bikestra commented 5 years ago

Hi @hankcs , sorry for the late reply. I refactored the code a bit, and after that I was only able to achieve 91.8 and couldn't reproduce 92.2, which was the reason my reply was delayed. I am playing around with hyperparameters but I haven't got a good success yet. I will need to more debugging / navigating commits to figure out why 92.2 is not reproducible. Just to make sure others can also take a look, I pushed the current state of the code that achieves 91.8. Check out the pull request or the master of https://github.com/bikestra/gluon-nlp . Below is the command I used to get 91.8 test F1.

GPU=0
OPTIMIZER=bertadam
BATCH=8
CONCAT_MAX_LEN=0
DROP=0.1
LR=1e-5
EPOCH=50
expname=lr${LR}_d${DROP}_b${BATCH}_o${OPTIMIZER}_reale${EPOCH}_c${CONCAT_MAX_LEN}
python3.6 ./train_bert_ner.py --train-path ${DATA_PATH}/train.txt --dev-path ${DATA_PATH}/dev.txt --test-path ${DATA_PATH}/test.txt --gpu ${GPU} --learning-rate ${LR} --dropout-prob ${DROP} --num-epochs ${EPOCH} --batch-size ${BATCH} --optimizer ${OPTIMIZER} --save-checkpoint-prefix ./models/${expname} --bert-model bert_24_1024_16 --seed 13531 |& tee log_${expname}.txt
bikestra commented 5 years ago

By running it for 100 epochs with the same command above, I was able to get test F1 of 92.1. So I think this is a matter of hyperparameter optimziation/random error and the code itself is (mostly) correct.

ikuyamada commented 5 years ago

@bikestra I could not find train_bert_ner.py used in the comment above. Is it same as train_ner.py contained in the repository?

szha commented 5 years ago

@ikuyamada this script will be included as part of the 0.7.0 release coming soon. The doc can be found at: http://gluon-nlp.mxnet.io/master/model_zoo/bert/index.html#bert-for-named-entity-recognition

If you'd like to try it out now, you can install the master version of GluonNLP by following the guide here: http://gluon-nlp.mxnet.io/master/install.html#install-from-master-branch

ikuyamada commented 5 years ago

@szha Thank you for your prompt reply! I will try it out!

zmd971202 commented 5 years ago

@bikestra Hi, could you tell me how you get 91.7 with bert-base and what tricks do you use? I am trying to reproduce the result of bert-base but I can only get 91.3 with bert-base.

ghaddarAbs commented 4 years ago

In order to reproduce the conll score reported in BERT paper (92.4 bert-base and 92.8 bert-large) one trick is to apply a truecaser on article titles (all upper case sentences) as preprocessing step for conll train/dev/test. This can be simply done with the following method.

#https://github.com/daltonfury42/truecase
#pip install truecase
import truecase
import re

# original tokens
#['FULL', 'FEES', '1.875', 'REOFFER', '99.32', 'SPREAD', '+20', 'BP']

def truecase_sentence(tokens):
   word_lst = [(w, idx) for idx, w in enumerate(tokens) if all(c.isalpha() for c in w)]
   lst = [w for w, _ in word_lst if re.match(r'\b[A-Z\.\-]+\b', w)]

   if len(lst) and len(lst) == len(word_lst):
       parts = truecase.get_true_case(' '.join(lst)).split()

       # the trucaser have its own tokenization ...
       # skip if the number of word dosen't match
       if len(parts) != len(word_lst): return tokens

       for (w, idx), nw in zip(word_lst, parts):
           tokens[idx] = nw

# truecased tokens
#['Full', 'fees', '1.875', 'Reoffer', '99.32', 'spread', '+20', 'BP']

Also, i found useful to use : very small learning rate (5e-6) \ large batch size (128) \ high epoch num (>40).

With these configurations and preprocessing, I was able to reach 92.8 with bert-large.