Closed ZanSara closed 9 months ago
Hi @ZanSara I know this is a fairly old issue, but I'm currently working on expanding the utility of the EntityExtractor in this issue: https://github.com/deepset-ai/haystack/issues/2969 and I think I have figured out how to mitigate some of the issues you identified above.
For example, HuggingFace provides an aggregation strategy option for TokenClassifcation pipelines that allow us to avoid the entities to be returned as tokens (e.g. "entities": [ "Theo", "##n Greyjoy"]
) would become Theon Greyjoy
if we use aggregation_strategy="first"
. I'll add this as an option to EntityExtractor
and I'm considering making the strategy "first"
the default value.
Closing since the original issue is solved by setting an aggregation strategy to aggregate token predictions.
Currently, the
EntityExtractor
uses BERT by default (dslim/bert-base-NER
). However, without fine-tuning this model is very likely to find named entities that are out of its vocabulary. Out-of-vocabulary words pose an issue, because they are split into syllables or single letters and therefore are hardly recognized as entities down the line.Example output of an extractive QA pipeline on the GoT dataset, where you can see several failure modes:
The model is the default,
dslim/bert-base-NER
.In addition, seems the
EntityExtractor
can't use spaCy models. I tried to usespacy/en_core_web_sm
but I faced this issue: