allenai / scispacy

A full spaCy pipeline and models for scientific/biomedical documents.
https://allenai.github.io/scispacy/
Apache License 2.0
1.68k stars 225 forks source link

entity recognition doesn't recognize locations #461

Closed maayansharon10 closed 1 year ago

maayansharon10 commented 1 year ago

Hi, Thank you for this wonderful library! Trying to use 'en_core_sci_lg' for simple entity recognition task, not sure if I'm missing something in the setup or it's a bug, would appreciate the help. This is the output of an example from spicy documentation.

when trying this:

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

the result is -

Apple 0 5 ORG
U.K. 27 31 GPE
$1 billion 44 54 MONEY

but when trying the same code with en_core_sci_lg -

 import spacy

nlp = spacy.load('en_core_sci_lg')
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

the result is -

Apple 0 5 ENTITY
U.K. 27 31 ENTITY
startup 32 39 ENTITY

working on google colab, installed the following - `! pip install spacy

! pip install scispacy

! pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_core_sci_lg-0.5.1.tar.gz`

Thank you!

dakinggg commented 1 year ago

The scispacy models are for identifying biomedical entities, so it is expected that some general entities will not be captured by scispacy models. If you are looking to identify more general entities like money amounts, etc, you'll be better off with the base spacy models.