Named Entity Recognition - KeyError when using predict function on text with apostrophe

dipanjanS commented 4 years ago

Describe the bug Basically a trained NER model fails on trying to predict on text with apostrophes

To Reproduce

You can train the model as follows (Dataset is a standard one in Kaggle:

import pandas as pd

df = pd.read_csv('https://github.com/dipanjanS/nlp_workshop_dhs18/raw/master/Unit%2008%20-%20Project%206%20-%20Build%20your%20NER%20Tagger/ner_dataset.csv.gz', compression='gzip', encoding='ISO-8859-1')

df = df.fillna(method='ffill')

df['sentence_id'] = [item.split(':')[1].strip() for item in df['Sentence #'].values]
df['words'] = df['Word']
df['pos'] = df['POS']
df['labels'] = df['Tag']
df = df[['sentence_id', 'words', 'pos', 'labels']]

custom_labels = df.labels.unique().tolist()

from sklearn.model_selection import train_test_split
import numpy as np

dataset = df[['sentence_id', 'words', 'labels']]

X_train, X_test = train_test_split(dataset, test_size=0.25, random_state=42, shuffle=False)

import logging
from simpletransformers.ner import NERModel, NERArgs

logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

# Configure the model
model_args = NERArgs()
model_args.train_batch_size = 16
model_args.evaluate_during_training = True

model = NERModel(
    "bert", "bert-base-cased", args=model_args, labels=custom_labels
)

X_training, X_eval = train_test_split(X_train, test_size=0.1, random_state=42, shuffle=False)

model.train_model(X_training, eval_data=X_eval)

result, model_outputs, preds_list = model.eval_model(X_eval)

# This also works when this data actually has apostrophes!
result, model_outputs, preds_list = model.eval_model(X_test)

Screenshots

In case screenshot not clear, refer below (using the above model to predict):

# no error when text has no apostrophe
predictions = model.predict(["A U.S. Congressional investigation into Hurricane Katrina blames failures at all levels of government for the suffering and loss of life that resulted from last Augusts storm ."], split_on_space=True)

INFO:simpletransformers.ner.ner_model: Converting to features started.
100%
1/1 [00:00<00:00, 22.71it/s]

Running Prediction: 100%
1/1 [00:00<00:00, 24.67it/s]

# error with text having apostrophe
predictions = model.predict(["A U.S. Congressional investigation into Hurricane Katrina blames failures at all levels of government for the suffering and loss of life that resulted from last August's storm ."], split_on_space=True)
INFO:simpletransformers.ner.ner_model: Converting to features started.
100%
1/1 [00:00<00:00, 5.49it/s]

Running Prediction: 100%
1/1 [00:00<00:00, 19.82it/s]

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-89-22279e4aa616> in <module>()
----> 1 predictions = model.predict(["A U.S. Congressional investigation into Hurricane Katrina blames failures at all levels of government for the suffering and loss of life that resulted from last August's storm ."], split_on_space=True)

/usr/local/lib/python3.6/dist-packages/simpletransformers/ner/ner_model.py in predict(self, to_predict, split_on_space)
    884                 if out_label_ids[i, j] != pad_token_label_id:
    885                     out_label_list[i].append(label_map[out_label_ids[i][j]])
--> 886                     preds_list[i].append(label_map[preds[i][j]])
    887 
    888         if split_on_space:

KeyError: 16

Desktop (please complete the following information):

OS Linux (Google Colab)

Additional context Not sure but mostly the apostrophe symbol in the text is causing issues only when using predict() some help would be appreciated.

dipanjanS commented 4 years ago

Facing the same issue when comma , is in the text, not sure if I am doing something fundamentally wrong

dipanjanS commented 4 years ago

Switched to roberta for now and works, will check back on bert later, probably could have happened with some leftover model weights on colab also. All good for now

ThilinaRajapakse / simpletransformers

Named Entity Recognition - KeyError when using predict function on text with apostrophe #719