Closed sammthomson closed 11 years ago
We never noticed this because we were always working with pre-tokenized input. But if you use raw text, the character offsets won't match up.
Should probably use token offsets instead, because that's what Semafor works with. Will help to include the span text in the output.
Could keep a map from token_index to character_index in the original input.
Fixed: we now use token offsets instead of character offsets, and include the span text in the output.
We never noticed this because we were always working with pre-tokenized input. But if you use raw text, the character offsets won't match up.
Should probably use token offsets instead, because that's what Semafor works with. Will help to include the span text in the output.
Could keep a map from token_index to character_index in the original input.