facebookresearch / GENRE

Autoregressive Entity Retrieval
Other
765 stars 103 forks source link

End-to-end EL: Mentions in beginning of text not recognized #74

Closed hertelm closed 2 years ago

hertelm commented 2 years ago

I follow the example for end-to-end entity linking with constraints on mentions and candidates.

from genre.entity_linking import get_end_to_end_prefix_allowed_tokens_fn_fairseq as get_prefix_allowed_tokens_fn
from genre.fairseq_model import GENRE
from genre.trie import Trie

if __name__ == "__main__":
    model = GENRE.from_pretrained("models/fairseq_e2e_entity_linking_aidayago").eval()

    while True:
        text = input("> ")
        sentences = [text]

        prefix_allowed_tokens_fn = get_prefix_allowed_tokens_fn(
            model,
            sentences,
            mention_trie=Trie([
                model.encode(" {}".format(e))[1:].tolist()
                for e in ["Einstein", "Nobel Prize"]
            ]),
            mention_to_candidates_dict={
                "Einstein": ["Albert Einstein", "Einstein (surname)"],
                "Nobel Prize": ["Nobel Prize in Physics", "Nobel Prize in Medicine"],
            }
        )

        result = model.sample(
            sentences,
            prefix_allowed_tokens_fn=prefix_allowed_tokens_fn,
        )

        for beam in result[0]:
            print(beam)

For the text "In 1921, Einstein received a Nobel Prize." I get the expected output:

> In 1921, Einstein received a Nobel Prize.
{'text': 'In 1921, { Einstein } [ Albert Einstein ] received a { Nobel Prize } [ Nobel Prize in Physics ].', 'score': tensor(-0.8925)}
{'text': 'In 1921, { Einstein } [ Einstein (surname) ] received a { Nobel Prize } [ Nobel Prize in Physics ].', 'score': tensor(-1.3275)}
{'text': 'In 1921, { Einstein } [ Albert Einstein ] received a Nobel Prize.', 'score': tensor(-1.4009)}
{'text': 'In 1921, Einstein received a { Nobel Prize } [ Nobel Prize in Physics ].', 'score': tensor(-1.8266)}
{'text': 'In 1921, Einstein received a Nobel Prize.', 'score': tensor(-3.4495)}

When the text is "Einstein received a Nobel Prize in 1921." or "Nobel Prize was given to Einstein in 1921.", the prediction takes very long, multiple minutes on GPU.

The results are:

> Einstein received a Nobel Prize in 1921.
{'text': 'Einstein received a { Nobel Prize } [ Nobel Prize in Physics ] in 1921.', 'score': tensor(-0.7218)}
{'text': 'Einstein received a { Nobel Prize } [ Nobel Prize in Medicine ] in 1921.', 'score': tensor(-1.1903)}
{'text': 'Einstein received a Nobel Prize in 1921.', 'score': tensor(-2.1739)}
> Nobel Prize was given to Einstein in 1921.
{'text': 'Nobel Prize was given to { Einstein } [ Albert Einstein ] in 1921.', 'score': tensor(-0.8092)}
{'text': 'Nobel Prize was given to { Einstein } [ Einstein (surname) ] in 1921.', 'score': tensor(-1.5115)}
{'text': 'Nobel Prize was given to Einstein in 1921.', 'score': tensor(-2.1765)}

I observe that the mentions in the beginning of the text are not recognized.

This can be solved by generating the mentions trie as follows, including the mention without preceding space:

mention_trie = Trie()
for e in mentions:
    mention_trie.add(model.encode(" {}".format(e))[1:].tolist())
    mention_trie.add(model.encode("{}".format(e))[1:].tolist())

Now the mentions in the beginning get recognized (see beams 3 and 4 in the first output and 2 and 3 in the second output - the spaces around the opening curly bracket look wrong though):

> Einstein received a Nobel Prize in 1921.
{'text': 'Einstein received a { Nobel Prize } [ Nobel Prize in Physics ] in 1921.', 'score': tensor(-0.7218)}
{'text': 'Einstein received a { Nobel Prize } [ Nobel Prize in Medicine ] in 1921.', 'score': tensor(-1.1903)}
{'text': ' {Einstein } [ Albert Einstein ] received a { Nobel Prize } [ Nobel Prize in Physics ] in 1921.', 'score': tensor(-1.4433)}
{'text': 'Einstein received a Nobel Prize in 1921.', 'score': tensor(-2.1739)}
{'text': ' {Einstein } [ Albert Einstein ] received a Nobel Prize in 1921.', 'score': tensor(-2.2591)}
> Nobel Prize was given to Einstein in 1921.
{'text': 'Nobel Prize was given to { Einstein } [ Albert Einstein ] in 1921.', 'score': tensor(-0.8092)}
{'text': ' {Nobel Prize } [ Nobel Prize in Physics ] was given to { Einstein } [ Albert Einstein ] in 1921.', 'score': tensor(-1.0578)}
{'text': ' {Nobel Prize } [ Nobel Prize in Medicine ] was given to { Einstein } [ Albert Einstein ] in 1921.', 'score': tensor(-1.3356)}
{'text': 'Nobel Prize was given to { Einstein } [ Einstein (surname) ] in 1921.', 'score': tensor(-1.5115)}
{'text': 'Nobel Prize was given to Einstein in 1921.', 'score': tensor(-2.1765)}

However, in both examples the model favors a beam where the mention in the beginning is not linked.

Would you recommend to include mentions without preceding space, or will the model never link them in the beginning of a text anyway? It could be that a bias in the training data prevents linking mentions in the beginning: the beginning of a Wikipedia abstract is usually the name of the article's entity, which is never a hyperlink to another article.

How did you deal with this in the experiments for the paper?

Thanks for letting me know + best regards, Matthias

nicola-decao commented 2 years ago

Hi,

I used mention_trie.add(model.encode(" {}".format(e))[1:].tolist()) to create the trie. But then I also appended a white space to every input sentence so that the model will always allow the begining of a sentence/ paragraph to be a mention.