Closed tomasonjo closed 1 year ago
Well, since the approach is seq2seq, there may be a situation where part of words have higher "decoding" probability on the beam search than the whole word. While this is not common it may happen with very frequent subwords. For instance country names such as
"Juan José Torres González (5 March 1920 – 2 June 1976) was a Bolivian socialist politician and military leader who served as the 50th president of Bolivia from 1970 to 1971, when he was ousted in a US-supported coup that resulted in the dictatorship of Hugo Banzer"
May produce:
{'relation': 'country', 'head_span': Juan José Torres González, 'tail_span': Bolivia}
As Bolivia is probably more frequently decoded with the relation country, which may lead to such "error". So using regex as you suggest would lead to an unfound span for Bolivia. I guess this can be avoided within the else of your suggestion, for which if there is no match using the regex you suggest, a fail-safe uses the current find() option. Do you think that would be useful?
Hi,
I have a potential improvement for SpaCy component. I've seen that this was part of the last commit as well. In particular I am referencing to the following code:
The problem is that
text.find()
does not respect word boundaries, so in the example:The extracted relationship is:
{'relation': 'manufacturer', 'head_span': Apple Inc, 'tail_span': Apple Inc}
If instead of
text.find()
, you would use regex that respects word boundaries:Then you would get correct output:
{'relation': 'manufacturer', 'head_span': Apple I, 'tail_span': Apple Inc}
As far as I understand, the entities can never consists of parts of words? If that is the case, then maybe regex with word boundary is a better approach?