Character offsets in the output are useless, because they index into the sentence *after* the tokenization step.

Noahs-ARK / semafor

http://www.ark.cs.cmu.edu/SEMAFOR

GNU General Public License v3.0

96 stars 46 forks source link

Closed sammthomson closed 11 years ago

sammthomson commented 11 years ago

We never noticed this because we were always working with pre-tokenized input. But if you use raw text, the character offsets won't match up.

Should probably use token offsets instead, because that's what Semafor works with. Will help to include the span text in the output.

Could keep a map from token_index to character_index in the original input.

sammthomson commented 11 years ago

Fixed: we now use token offsets instead of character offsets, and include the span text in the output.