Babelscape / rebel

REBEL is a seq2seq model that simplifies Relation Extraction (EMNLP 2021).
502 stars 73 forks source link

SpaCy component mapping to docs #53

Closed tomasonjo closed 1 year ago

tomasonjo commented 1 year ago

Hi,

I have a potential improvement for SpaCy component. I've seen that this was part of the last commit as well. In particular I am referencing to the following code:

head_index = text.find(triplet["head"].lower())
tail_index = text.find(triplet["tail"].lower())

The problem is that text.find() does not respect word boundaries, so in the example:

The Apple Inc 's first product was the Apple I , a computer designed and hand - built entirely by Steve Wozniak .

The extracted relationship is:

{'relation': 'manufacturer', 'head_span': Apple Inc, 'tail_span': Apple Inc}

If instead of text.find(), you would use regex that respects word boundaries:

head_match = re.search(r'\b' + re.escape(triplet["head"].lower()) + r'\b', text)
if head_match:
  head_index = head_match.start()
else:
  continue

tail_match = re.search(r'\b' + re.escape(triplet["tail"].lower()) + r'\b', text)
if tail_match:
  tail_index = tail_match.start()
else:
  continue

Then you would get correct output:

{'relation': 'manufacturer', 'head_span': Apple I, 'tail_span': Apple Inc}

As far as I understand, the entities can never consists of parts of words? If that is the case, then maybe regex with word boundary is a better approach?

LittlePea13 commented 1 year ago

Well, since the approach is seq2seq, there may be a situation where part of words have higher "decoding" probability on the beam search than the whole word. While this is not common it may happen with very frequent subwords. For instance country names such as

"Juan José Torres González (5 March 1920 – 2 June 1976) was a Bolivian socialist politician and military leader who served as the 50th president of Bolivia from 1970 to 1971, when he was ousted in a US-supported coup that resulted in the dictatorship of Hugo Banzer"

May produce:

{'relation': 'country', 'head_span': Juan José Torres González, 'tail_span': Bolivia}

As Bolivia is probably more frequently decoded with the relation country, which may lead to such "error". So using regex as you suggest would lead to an unfound span for Bolivia. I guess this can be avoided within the else of your suggestion, for which if there is no match using the regex you suggest, a fail-safe uses the current find() option. Do you think that would be useful?