Closed Rajmehta123 closed 4 years ago
Your problem involves tokenization and detokenization. If your detokenizer fits with the tokenizer, this problem will not occur. For example, if you use word_tokenize
of nltk then the detokenizer should be TreebankWordDetokenizer
from nltk.tokenize.treebank
.
I have not used any tokenizer, rather can state it as I am using the default deeppavlov which I assume is using huggingface tokenizer. I am using the pre-trained model provided by deeppavlov
build_model(configs.ner.ner_ontonotes_bert, download=False)
The input to this is sentences and output is tokens with labels. Can I use custom tokens and feed it to the pre-trained model? If yes, can I use nltk tokenizer which tokenizes using spaces or need a best based tokenizer?
I tried to detokenize using TreebankWordDetokenizer
but it still did not form the original sentence. For eg: Orig: sentence -> parties. \n \n IN WITNESS WHEREOF, the parties hereto
tokenized and detokenized sentence -> parties . IN WITNESS WHEREOF, the parties hereto
Another example: Orig: sentence -> Group’s employment, Group shall be
tokenized and detokenized sentence -> Group ’ s employment, Group shall be
Note that period and newlines are stripped using the TreebankWordDetokenizer
.
I have not used any tokenizer, rather can state it as I am using the default deeppavlov which I assume is using huggingface tokenizer. I am using the pre-trained model provided by deeppavlov
build_model(configs.ner.ner_ontonotes_bert, download=False)
The input to this is sentences and output is tokens with labels. Can I use custom tokens and feed it to the pre-trained model? If yes, can I use nltk tokenizer which tokenizes using spaces or need a best based tokenizer?
Yes, you can use custom tokens and feed it to the pre-trained model without the support of any tokenizer. In DeepPavlov, we have bert_ner_preprocessor
that takes as input raw words and split them into BERT subtokens.
I tried to detokenize using
TreebankWordDetokenizer
but it still did not form the original sentence. For eg: Orig: sentence ->parties. \n \n IN WITNESS WHEREOF, the parties hereto
tokenized and detokenized sentence ->parties . IN WITNESS WHEREOF, the parties hereto
Another example: Orig: sentence ->
Group’s employment, Group shall be
tokenized and detokenized sentence ->Group ’ s employment, Group shall be
Note that period and newlines are stripped using the
TreebankWordDetokenizer
.
I just take word_tokenize
and detokenize
from nltk
as an example. I mean tokenizer
and detokenizer
need to fit with each other. Such kind of text parties. \n \n IN WITNESS WHEREOF, the parties hereto
should be preprocessed before feeding to the model.
original text -> preprocessor -> NER model -> list of tags
list of tags, original text -> postprocessor -> tag sequence
Got it. Sounds good. Let me try that if it works, I will close the issue. Also, any parameter to indicate that instead of raw text, I will pass tokens to the pre-trained model. Or do I have to change the source code function (bert_ner_preprocessor) to remove the tokenization process?
Thank you for your help.
I figured out a solution to this problem.
def join_tokens(tokens):
res = ''
if tokens:
res = tokens[0]
for token in tokens[1:]:
if not (token.isalpha() and res[-1].isalpha()):
res += token # punctuation
else:
res += ' ' + token # regular word
return res
def collapse(ner_result):
# List with the result
collapsed_result = []
current_entity_tokens = []
current_entity = None
# Iterate over the tagged tokens
for token, tag in ner_result:
if tag.startswith("B-"):
# ... if we have a previous entity in the buffer, store it in the result list
if current_entity is not None:
collapsed_result.append([join_tokens(current_entity_tokens), current_entity])
current_entity = tag[2:]
# The new entity has so far only one token
current_entity_tokens = [token]
# If the entity continues ...
elif current_entity_tokens!= None and tag == "I-" + str(current_entity):
# Just add the token buffer
current_entity_tokens.append(token)
else:
collapsed_result.append([join_tokens(current_entity_tokens), current_entity])
collapsed_result.append([token,tag[2:]])
current_entity_tokens = []
current_entity = None
pass
# The last entity is still in the buffer, so add it to the result
# ... but only if there were some entity at all
if current_entity is not None:
collapsed_result.append([join_tokens(current_entity_tokens), current_entity])
collapsed_result = sorted(collapsed_result)
collapsed_result = list(k for k, _ in itertools.groupby(collapsed_result))
return collapsed_result
Update
This will solve most of the cases, but there always be outliers.
For eg: The tags for the sentence "U.S. Securities and Exchange Commission" are ['U.S.', 'B-ORG'] ['Securities', 'I-ORG'] ['and', 'I-ORG'] ['Exchange', 'I-ORG'] ['Commission', 'I-ORG'] And when run the collapse command changed the sentence into: "U.S.Securities and Exchange Commission"
So the complete solution is to track the identity of the word that created a certain token. Creating LUT for the original sentence. Thus
text="U.S. Securities and Exchange Commission"
lut = [(token, ix) for ix, word in enumerate(text.split()) for token in tokenize(w)]
# lut = [("U",0), (".",0), ("S",0), (".",0), ("Securities",1), ("and",2), ("Exchange",3), ("Commision",4)]
Now, given token index you can know exact word it came from, and simply concatenate tokens that belong to the same word, while adding space when a token belongs to a different word. So the NER result would be something like:
[["U","B-ORG", 0], [".","I-ORG", 0], ["S", "I-ORG", 0], [".","I-ORG", 0], ['Securities', 'I-ORG', 1], ['and', 'I-ORG', 2], ['Exchange', 'I-ORG',3], ['Commission', 'I-ORG', 4]]
Any way to combine the BIO tokens into compound words. I implemented this method to combine words but this does not work well for words with punctuations. For eg: S.E.C using the above function will join it as S . E . C
` CODE:
``
Any workaround to form compound words?