Offsets returned by model.predict are not usable if there is whitespace in the text.

johann-petrak commented 2 years ago

This is a huge problem, because I would like to create stand-off annotations for the detected entities in the original document:

For example, the sentence may look like this

txt = "          Microsoft and Apple" # (starting with ten spaces)

then what model.predict([txt]) returns is:

[{'entity': [
  {'type': 'organization', 'position': [0, 9], 'mention': 'Microsoft', 'probability': 0.9995076656341553}, 
  {'type': 'organization', 'position': [14, 19], 'mention': 'Apple', 'probability': 0.9992972612380981}], 
'sentence': 'Microsoft and Apple'}]

As can be seen, the leading whitespace has been removed also in the returned "sentence" field.

This also happens if whitespace is in the middle of the sentence e.g.

txt = "Microsoft          and Apple" # (ten spaces after Microsoft)

returns

[{'entity': 
  [{'type': 'organization', 'position': [0, 9], 'mention': 'Microsoft', 'probability': 0.9995076656341553}, 
  {'type': 'organization', 'position': [14, 19], 'mention': 'Apple', 'probability': 0.9992972612380981}], 
'sentence': 'Microsoft and Apple'}]

Again the returned sentece text contains a single space where the original text contained 10.

This makes it hard to reliable map the offsets back to the true offsets in the original text. It is also not clear which other characters would cause any changes with the offsets. Is there a way to guarantee getting back the proper offsets or at least getting information about which characters in the original text have been removed? Where exactly does this happen in the code?

johann-petrak commented 2 years ago

Another good option would be if one could pass already tokenized text, i.e. instead of a batch of texts, pass a batch of token lists and get back for each chunk the list of included tokens instead of included characters.

johann-petrak commented 2 years ago

As a workaround I am using an alginment algorithm to align text from the original string to the text returned as "sentence" and adapt the offsets accordingly, but this assumes that all whitespace is getting trimmed and reduced. If this also happens with other unicode characters, the offsets will still be wrong.

Is it known exactly which characters in the string may get removed or reduced to a single one?

asahi417 commented 2 years ago

Hi thanks so much for working around the issue, and this is absolutely not a healthy behavior indeed. In the code that normalizes the halfspaces, I'll try to keep the information of the pre-processing with which we can restore the original input and offsets to adjust the predicted entity span correctly.

JaouadMousser commented 2 years ago

Hi All, in my case the offset information is entirely unreliable, ex.

text = """This is a test. Policyholder: AXA Winterthur. Period: 12.02.2021 to 11.02.2022. Inception date: 10.02.2022. Expiration date: 12.10.2022. Country: Germany. City: Köln."""

preds = [{'entity': [{'type': 'PARTNER', 'position': [31, 41], 'mention': 'AXA Winter', 'probability': 0.7145251780748367}, {'type': 'PERIOD', 'position': [56, 62], 'mention': '12. 02', 'probability': 0.8630470236142477}, {'type': 'PERIOD', 'position': [64, 78], 'mention': '2021 to 11. 02', 'probability': 0.9743085741996765}, {'type': 'PERIOD', 'position': [80, 84], 'mention': '2022', 'probability': 0.9162776470184326}, {'type': 'INCEPTION', 'position': [103, 115], 'mention': '10. 02. 2022', 'probability': 0.826356315612793}, {'type': 'EXPIRATION', 'position': [135, 147], 'mention': '12. 10. 2022', 'probability': 0.8099642634391785}, {'type': 'COUNTRY', 'position': [159, 166], 'mention': 'Germany', 'probability': 0.9960137605667114}], 'sentence': 'This is a test. Policyholder : AXA Winterthur. Period : 12. 02. 2021 to 11. 02. 2022. Inception date : 10. 02. 2022. Expiration date : 12. 10. 2022. Country : Germany. City : Köln.'}]

print(text[31:41]) ==> 'XA Wintert'

johann-petrak commented 2 years ago

You can find my workaround here: https://github.com/GateNLP/python-gatenlp-ml-tner/blob/cb367881516b7d130aa888bda126a7a494828cf6/gatenlp_ml_tner/annotators.py#L79

It is based on the assumptions, that only multiple and leading whitespace causes the misalginments and it uses the nltk align_tokens method for help.

However, getting the correct offsets right away would obviously be much better. Note that all "fast" tokenizers in the huggingface library can give you the original offsets for each transformers token as the library offers Encoding.token_to_chars(tokenidx) and similar methods to help with this.

asahi417 / tner

Offsets returned by model.predict are not usable if there is whitespace in the text. #23