Closed alexrus closed 4 years ago
Hi @alexrus Sorry for the late reply.
We assumed that the input string already comes in the form of a basic tokenized string (such as that provided by nltk).
E.g. Instead of I am Mr. John.
we expect I am Mr. John .
as an input.
So, the use-case of the above function is to avoid splitting abbreviations like Mr.
. Cases like John.
are handled in the basic tokenization step. (Almost all the datasets in our experiments were already tokenized in this manner)
what is the usecase for the following check: https://github.com/awasthiabhijeet/PIE/blob/91e02ba2cd37a4b55fb52fa8759d20fb8989cfc2/tokenization.py#L100
If the sentece ends in a Capital word, and has a dot, these will be considered as a single token, and another dot will be added: "My name is John." -> "My", "name", "is", "John.", "."
Strange thing is that this is only used with ".", but not with "!", "?"
@awasthiabhijeet can you tell me what was the purpose for this?