`EntityExtractor` can't deal well with out-of-vocabulary words

ZanSara commented 3 years ago

Currently, the EntityExtractor uses BERT by default (dslim/bert-base-NER). However, without fine-tuning this model is very likely to find named entities that are out of its vocabulary. Out-of-vocabulary words pose an issue, because they are split into syllables or single letters and therefore are hardly recognized as entities down the line.

Example output of an extractive QA pipeline on the GoT dataset, where you can see several failure modes:

[
    {
        // This is probably the model's fault, not recognizing "Snow" as part of the entity
        "answer": "Jon Snow",
        "entities": [
            "Jon"
        ]
    },
    {
        // This is also a predictable failure mode of the model, that splits the entity into two (arguably correct even)
        // However, "Brienne" became "Brien": the tokenizer split it
        "answer": "Brienne of Tarth",
        "entities": [
            "Brien",
            "Tarth"
        ]
    },
    {
        // This is the tokenizer splitting "Cersei" and then recognizing only half of it
        "answer": "Cersei",
        "entities": [
            "Ce"
        ]
    },
    {
        // Here "Theon" is split, while "Greyjoy" is not split consistently
        "answer": "Theon Greyjoy or House Greyjoy",
        "entities": [
            "Theo",
            "##n Greyjoy",
            "House",
            "Grey",
            "##joy"
        ]
    }
]

The model is the default, dslim/bert-base-NER.

In addition, seems the EntityExtractor can't use spaCy models. I tried to use spacy/en_core_web_sm but I faced this issue:

404 Client Error: Not Found for url: https://huggingface.co/spacy/en_core_web_sm/resolve/main/config.json
Traceback (most recent call last):
  File "/home/sara/work/haystack/venv/lib/python3.9/site-packages/transformers/configuration_utils.py", line 484, in get_config_dict
    resolved_config_file = cached_path(
  File "/home/sara/work/haystack/venv/lib/python3.9/site-packages/transformers/file_utils.py", line 1329, in cached_path
    output_path = get_from_cache(
  File "/home/sara/work/haystack/venv/lib/python3.9/site-packages/transformers/file_utils.py", line 1500, in get_from_cache
    r.raise_for_status()
  File "/home/sara/work/haystack/venv/lib/python3.9/site-packages/requests/models.py", line 953, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/spacy/en_core_web_sm/resolve/main/config.json

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/sara/work/haystack/example.py", line 10, in <module>
    ner = EntityExtractor("spacy/en_core_web_sm")
  File "/home/sara/work/haystack/haystack/nodes/extractor/entity.py", line 25, in __init__
    tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
  File "/home/sara/work/haystack/venv/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py", line 412, in from_pretrained
    config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
  File "/home/sara/work/haystack/venv/lib/python3.9/site-packages/transformers/models/auto/configuration_auto.py", line 446, in from_pretrained
    config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/home/sara/work/haystack/venv/lib/python3.9/site-packages/transformers/configuration_utils.py", line 504, in get_config_dict
    raise EnvironmentError(msg)
OSError: Can't load config for 'spacy/en_core_web_sm'. Make sure that:

- 'spacy/en_core_web_sm' is a correct model identifier listed on 'https://huggingface.co/models'

- or 'spacy/en_core_web_sm' is the correct path to a directory containing a config.json file

sjrl commented 2 years ago

Hi @ZanSara I know this is a fairly old issue, but I'm currently working on expanding the utility of the EntityExtractor in this issue: https://github.com/deepset-ai/haystack/issues/2969 and I think I have figured out how to mitigate some of the issues you identified above.

For example, HuggingFace provides an aggregation strategy option for TokenClassifcation pipelines that allow us to avoid the entities to be returned as tokens (e.g. "entities": [ "Theo", "##n Greyjoy"]) would become Theon Greyjoy if we use aggregation_strategy="first". I'll add this as an option to EntityExtractor and I'm considering making the strategy "first" the default value.

sjrl commented 9 months ago

Closing since the original issue is solved by setting an aggregation strategy to aggregate token predictions.

deepset-ai / haystack

`EntityExtractor` can't deal well with out-of-vocabulary words #1706