Closed johann-petrak closed 2 years ago
Another good option would be if one could pass already tokenized text, i.e. instead of a batch of texts, pass a batch of token lists and get back for each chunk the list of included tokens instead of included characters.
As a workaround I am using an alginment algorithm to align text from the original string to the text returned as "sentence" and adapt the offsets accordingly, but this assumes that all whitespace is getting trimmed and reduced. If this also happens with other unicode characters, the offsets will still be wrong.
Is it known exactly which characters in the string may get removed or reduced to a single one?
Hi thanks so much for working around the issue, and this is absolutely not a healthy behavior indeed. In the code that normalizes the halfspaces, I'll try to keep the information of the pre-processing with which we can restore the original input and offsets to adjust the predicted entity span correctly.
Hi All, in my case the offset information is entirely unreliable, ex.
text = """This is a test. Policyholder: AXA Winterthur. Period: 12.02.2021 to 11.02.2022. Inception date: 10.02.2022. Expiration date: 12.10.2022. Country: Germany. City: Köln."""
preds = [{'entity': [{'type': 'PARTNER', 'position': [31, 41], 'mention': 'AXA Winter', 'probability': 0.7145251780748367}, {'type': 'PERIOD', 'position': [56, 62], 'mention': '12. 02', 'probability': 0.8630470236142477}, {'type': 'PERIOD', 'position': [64, 78], 'mention': '2021 to 11. 02', 'probability': 0.9743085741996765}, {'type': 'PERIOD', 'position': [80, 84], 'mention': '2022', 'probability': 0.9162776470184326}, {'type': 'INCEPTION', 'position': [103, 115], 'mention': '10. 02. 2022', 'probability': 0.826356315612793}, {'type': 'EXPIRATION', 'position': [135, 147], 'mention': '12. 10. 2022', 'probability': 0.8099642634391785}, {'type': 'COUNTRY', 'position': [159, 166], 'mention': 'Germany', 'probability': 0.9960137605667114}], 'sentence': 'This is a test. Policyholder : AXA Winterthur. Period : 12. 02. 2021 to 11. 02. 2022. Inception date : 10. 02. 2022. Expiration date : 12. 10. 2022. Country : Germany. City : Köln.'}]
print(text[31:41]) ==> 'XA Wintert'
You can find my workaround here: https://github.com/GateNLP/python-gatenlp-ml-tner/blob/cb367881516b7d130aa888bda126a7a494828cf6/gatenlp_ml_tner/annotators.py#L79
It is based on the assumptions, that only multiple and leading whitespace causes the misalginments and it uses the nltk align_tokens method for help.
However, getting the correct offsets right away would obviously be much better. Note that all "fast" tokenizers in the huggingface library can give you the original offsets for each transformers token as the library offers Encoding.token_to_chars(tokenidx)
and similar methods to help with this.
This is a huge problem, because I would like to create stand-off annotations for the detected entities in the original document:
For example, the sentence may look like this
then what
model.predict([txt])
returns is:As can be seen, the leading whitespace has been removed also in the returned "sentence" field.
This also happens if whitespace is in the middle of the sentence e.g.
returns
Again the returned sentece text contains a single space where the original text contained 10.
This makes it hard to reliable map the offsets back to the true offsets in the original text. It is also not clear which other characters would cause any changes with the offsets. Is there a way to guarantee getting back the proper offsets or at least getting information about which characters in the original text have been removed? Where exactly does this happen in the code?