Open gunturbudi opened 3 years ago
I manage to solve the problems with this modification in utils.py. It basically check what I mentioned in the above issues, if the last character of the previous word is the same as the first character of the next word.
def get_offsets(
text: str,
tokens: List[str],
start: Optional[int] = 0) -> List[int]:
"""Calculate char offsets of each tokens.
Args:
text (str): The string before tokenized.
tokens (List[str]): The list of the string. Each string corresponds
token.
start (Optional[int]): The start position.
Returns:
(List[str]): The list of the offset.
"""
offsets = []
i = 0
same_char = False
for k, token in enumerate(tokens):
if token[0] == tokens[k-1][-1]:
same_char = True
else:
same_char = False
for j, char in enumerate(token):
while char != text[i] or same_char:
i += 1
same_char = False
if j == 0:
offsets.append(i + start)
return offsets
I don't know if it's the best solution. But it works for me, and luckily my NER model improves :)
Regards
Hello, I want to report a bug. I have a Sentence like this:
As a administrator, I want to refund sponsorship money that was processed via stripe, so that people get their monies back.
When I try to convert it to CoNLL the span is not converted well. I then debug the library and found that the offset is wrong. Here is the output of the offset:
As you can see, in the second and third lines, the offset is the same (3 and 3, while It should be 3 and 5). This behavior makes the span undetected in the conversion process.
It seems that the
get_offsets
function inutils.py
checks the equality in the sequence of characters to decide about the offsets.It will be a problem if the last character of the previous word is the same as the first character of the next word. I'm still looking for a fix to this problem.
Cheers