allenai / allennlp

An open-source NLP research library, built on PyTorch.
http://www.allennlp.org
Apache License 2.0
11.73k stars 2.24k forks source link

Named Entity Recognition predicts too many "O" tags for some inputs #2487

Closed hanktopia closed 5 years ago

hanktopia commented 5 years ago

Describe the bug Calling the predictor using input longer than a couple sentences returns a tag of "O" for every word in the input. If, using the same sample, the first few sentences only are fed into the predictor then the tags identify entities as expected.

To Reproduce The behavior can be seen on the demo page: https://demo.allennlp.org/named-entity-recognition

Copying this into the sentence field and hitting run shows no found named entities: The Jayhawks' first coach was the inventor of the game of basketball, James Naismith. Naismith, ironically, is the only coach in Kansas basketball history with a losing record. The Kansas basketball program has produced many notable professional players, including Clyde Lovellette, Wilt Chamberlain, Jo Jo White, Danny Manning, Raef LaFrentz, Paul Pierce, Nick Collison, Kirk Hinrich, Mario Chalmers, Andrew Wiggins and Joel Embiid. Politician Bob Dole also played basketball at Kansas.[2] Former players that have gone on to be coaches include Phog Allen, Adolph Rupp, Dean Smith, Dutch Lonborg, and former assistants to go on to be notable coaches include John Calipari, Gregg Popovich, and Bill Self. Mark Turgeon, Jerod Haase, and Danny Manning are all former players and assistant coaches that became head coaches. Allen founded the National Association of Basketball Coaches and, with Lonborg, was an early proponent of the NCAA tournament.[3][4] Four different Jayhawk head coaches are in the Naismith Memorial Basketball Hall of Fame as coaches, Phog Allen, Larry Brown, Roy Williams, and current head coach Bill Self.

Copying this shortened version of the same text into sentence and hitting run shows named entities as espected: The Jayhawks' first coach was the inventor of the game of basketball, James Naismith. Naismith, ironically, is the only coach in Kansas basketball history with a losing record. The Kansas basketball program has produced many notable professional players, including Clyde Lovellette, Wilt Chamberlain, Jo Jo White, Danny Manning, Raef LaFrentz, Paul Pierce, Nick Collison, Kirk Hinrich, Mario Chalmers, Andrew Wiggins and Joel Embiid. Politician Bob Dole also played basketball at Kansas

I see the same behavior in my application, system details below

Expected behavior Large blocks of text can be input to the NER predictor and the entities should be found. It shouldn't fail silently.

System (please complete the following information): Application running in a docker container on Ubuntu: From Dockerfile

FROM python:3.6
# celery has an issue with 3.7 due to introduction of async keyword

RUN pip3 install numpy
RUN pip3 install https://download.pytorch.org/whl/cpu/torch-1.0.0-cp36-cp36m-linux_x86_64.whl
RUN pip3 install torchvision
RUN pip3 install allennlp

RUN pip3 install celery[redis]

python3 --version Python 3.6.8

Python package versions:

pip3 list
Package                  Version    
------------------------ -----------
alabaster                0.7.12     
allennlp                 0.8.1      
amqp                     2.4.1      
asn1crypto               0.24.0     
atomicwrites             1.3.0      
attrs                    18.2.0     
aws-xray-sdk             0.95       
awscli                   1.16.96    
Babel                    2.6.0      
billiard                 3.5.0.5    
boto                     2.49.0     
boto3                    1.9.86     
botocore                 1.12.86    
celery                   4.2.1      
certifi                  2018.11.29 
cffi                     1.11.5     
chardet                  3.0.4      
Click                    7.0        
colorama                 0.3.9      
conllu                   0.11       
cookies                  2.2.1      
cryptography             2.5        
cycler                   0.10.0     
cymem                    2.0.2      
cytoolz                  0.9.0.1    
dill                     0.2.9      
docker                   3.7.0      
docker-pycreds           0.4.0      
docutils                 0.14       
ecdsa                    0.13       
editdistance             0.5.2      
en-core-web-sm           2.0.0      
flaky                    3.5.3      
Flask                    1.0.2      
Flask-Cors               3.0.7      
ftfy                     5.5.1      
future                   0.17.1     
gevent                   1.3.6      
greenlet                 0.4.15     
h5py                     2.9.0      
idna                     2.8        
imagesize                1.1.0      
itsdangerous             1.1.0      
Jinja2                   2.10       
jmespath                 0.9.3      
jsondiff                 1.1.1      
jsonnet                  0.10.0     
jsonpickle               1.1        
kiwisolver               1.0.1      
kombu                    4.2.2.post1
MarkupSafe               1.1.0      
matplotlib               2.2.3      
mock                     2.0.0      
more-itertools           5.0.0      
moto                     1.3.4      
msgpack                  0.5.6      
msgpack-numpy            0.4.3.2    
murmurhash               1.0.1      
nltk                     3.4        
numpy                    1.16.1     
numpydoc                 0.8.0      
overrides                1.9        
packaging                19.0       
parsimonious             0.8.0      
pbr                      5.1.2      
Pillow                   5.4.1      
pip                      18.1       
plac                     0.9.6      
pluggy                   0.8.1      
preshed                  2.0.1      
protobuf                 3.6.1      
py                       1.7.0      
pyaml                    18.11.0    
pyasn1                   0.4.5      
pycparser                2.19       
pycryptodome             3.7.3      
Pygments                 2.3.1      
pyparsing                2.3.1      
pytest                   4.2.0      
python-dateutil          2.7.5      
python-jose              2.0.2      
pytorch-pretrained-bert  0.3.0      
pytz                     2017.3     
PyYAML                   3.13       
redis                    3.1.0      
regex                    2018.1.10  
requests                 2.21.0     
responses                0.10.5     
rsa                      3.4.2      
s3transfer               0.1.13     
scikit-learn             0.20.2     
scipy                    1.2.0      
setuptools               40.6.3     
singledispatch           3.4.0.3    
six                      1.12.0     
snowballstemmer          1.2.1      
spacy                    2.0.18     
Sphinx                   1.8.4      
sphinxcontrib-websupport 1.1.0      
sqlparse                 0.2.4      
tensorboardX             1.2        
thinc                    6.12.1     
toolz                    0.9.0      
torch                    1.0.0      
torchvision              0.2.1      
tqdm                     4.30.0     
ujson                    1.35       
Unidecode                1.0.23     
urllib3                  1.24.1     
vine                     1.2.0      
wcwidth                  0.1.7      
websocket-client         0.54.0     
Werkzeug                 0.14.1     
wheel                    0.32.3     
wrapt                    1.11.1     
xmltodict                0.11.0  

Allen NLP Initialization:

class AllenNlp:
    def __init__(self):
        self.archive = load_archive('/models/fine-grained-ner-model-elmo-2018.12.21.tar.gz')
        self.predictor = Predictor.from_archive(self.archive)

    def process(self, source):
        results = self.predictor.predict(sentence=source)

Additional context I'm actively developing and my Dockerfile doesn't specify version numbers, so I'm pulling in and running new releases as they become available. Longer blocks of text worked a couple days ago, but stopped working yesterday.

DeNeutoy commented 5 years ago

This is odd, because it's not actually to do with text length - if you take the first sentence and copy and paste it loads of times, the demo can still handle it. I wonder what's happening for that particular paragraph....

hanktopia commented 5 years ago

You're right, I assumed too much. If you remove the sentence starting with "Former players", it works more like you'd expect. So, it's not a length problem but doesn't appear to be working correctly either. I don't have other definitive tests, but I'm seeing more missed entities than before overall.