NER performance with Ontonotes and number-related ELMo embeddings

kermitt2 commented 6 years ago

Thanks a lot for this work and making it available!

I used ELMo contextualized embeddings in my Keras framework (DeLFT) and I could reproduce the excellent results for CoNLL 2003 NER task - actually slightly better than what you reported in your NAACL 2018 paper (92.47 averaged over 10 training, using the 5.5B ELMo model, warm-up, concatenation with Glove embeddings with a Lample 2016 BiLSTM-CRF architecture).

However when using ELMo embeddings with NER Ontonotes CoNLL-2012 dataset, I have a large drop of -5.0 points for f-score as compared to Glove only. The drop is the same when using ELMo only or ELMo embeddings concatenated with Glove.

Here is the evaluation with Glove without ELMo:

Evaluation on test set:
        f1 (micro): 86.17
                 precision    recall  f1-score   support

       QUANTITY     0.7321    0.7810    0.7558       105
          EVENT     0.6275    0.5079    0.5614        63
           NORP     0.9193    0.9215    0.9204       841
       CARDINAL     0.8294    0.7487    0.7870       935
        ORDINAL     0.7982    0.9128    0.8517       195
            ORG     0.8451    0.8635    0.8542      1795
       LANGUAGE     0.7059    0.5455    0.6154        22
           TIME     0.6000    0.5943    0.5972       212
        PRODUCT     0.7333    0.5789    0.6471        76
            FAC     0.6630    0.4519    0.5374       135
           DATE     0.8015    0.8571    0.8284      1602
          MONEY     0.8714    0.8631    0.8672       314
            LAW     0.6786    0.4750    0.5588        40
        PERCENT     0.8808    0.8682    0.8745       349
    WORK_OF_ART     0.6480    0.4880    0.5567       166
            LOC     0.7500    0.7709    0.7603       179
            GPE     0.9494    0.9388    0.9441      2240
         PERSON     0.9038    0.9306    0.9170      1988

    avg / total     0.8618    0.8615    0.8617     11257

And here are the results with ELMo:

Evaluation on test set:
    f1 (micro): 79.62
             precision    recall  f1-score   support

WORK_OF_ART     0.5510    0.6506    0.5967       166
    PRODUCT     0.6582    0.6842    0.6710        76
      MONEY     0.8116    0.8503    0.8305       314
        FAC     0.7130    0.5704    0.6337       135
   LANGUAGE     0.7778    0.6364    0.7000        22
   QUANTITY     0.1361    0.8000    0.2327       105
       TIME     0.6370    0.4387    0.5196       212
        GPE     0.9535    0.9437    0.9486      2240
      EVENT     0.6316    0.7619    0.6906        63
    PERCENT     0.8499    0.8596    0.8547       349
        ORG     0.9003    0.8758    0.8879      1795
        LOC     0.7611    0.7654    0.7632       179
     PERSON     0.9297    0.9452    0.9374      1988
    ORDINAL     0.8148    0.1128    0.1982       195
        LAW     0.5405    0.5000    0.5195        40
       NORP     0.9191    0.9322    0.9256       841
   CARDINAL     0.8512    0.1102    0.1951       935
       DATE     0.8537    0.5137    0.6415      1602

avg / total     0.8423    0.7548    0.7962     11257

I see that the drop is always for named entity classes related somehow to numbers (ORDINAL -65, CARDINAL -58, QUANTITY -53, DATE -18, etc.), and the recognition of all the other classes are actually improving with ELMo.

I am wondering what could cause this behavior (apart an implementation error from me), did you observe something similar?
Are you using special normalization of numbers on the corpus before training the BiLM? I am using the default tokenization of Onotnotes/CoNLL-2012, should I use maybe another particular tokenization?

matt-peters commented 6 years ago

Thanks for posting this with the excellent details! This is quite interesting. I haven't noticed any strange effects with numbers, but haven't looked at them in detail. We aren't doing any special tokenization or normalization of numbers when training the model, and they are treated the same as all other tokens. When using datasets like Ontonotes we also just use the existing, provided tokenization.

Two questions:

can you post an example of a few sentences from the dataset with these entity classes (e.g. ORDINAL, QUANTITY, DATE, etc)? This might help uncover the problem.
have you tried the original ELMo model (not the 5.5B model) for this task? We haven't experimented with the 5.5B model too heavily, and it's possible this is a problem with the 5.5B model but not in the original model.

matt-peters commented 6 years ago

Hi @kermitt2 -- just to follow up, we aren't able to reproduce these results on our end, and we are seeing improved performance with ELMo for all entity types in this dataset (including ORDINAL, etc). Perhaps it's something particular to how you are handling numbers vs strings in your pre-processing pipeline?

kermitt2 commented 6 years ago

Hello @matt-peters Sorry for the late reply. The follow-up is super useful and I will revisit and double check my pre-processing given that it is thus coming from my side. The original ELMo model gave similar results for me. Many thanks!

matt-peters commented 6 years ago

Just to close the loop on this, we saw a 0.882 development set F1 using the 5.5B ELMo model for this dataset (haven't checked the test set performance but it should be similar).

cakeinspace commented 6 years ago

Hey can you tell me where i can download the pretrained elmo ner ontonotes model. Thanks

allenai / bilm-tf

NER performance with Ontonotes and number-related ELMo embeddings #59