awasthiabhijeet / PIE

Fast + Non-Autoregressive Grammatical Error Correction using BERT. Code and Pre-trained models for paper "Parallel Iterative Edit Models for Local Sequence Transduction": www.aclweb.org/anthology/D19-1435.pdf (EMNLP-IJCNLP 2019)
MIT License
228 stars 40 forks source link

usage for last_dot_first_capital? #20

Closed alexrus closed 4 years ago

alexrus commented 4 years ago

what is the usecase for the following check: https://github.com/awasthiabhijeet/PIE/blob/91e02ba2cd37a4b55fb52fa8759d20fb8989cfc2/tokenization.py#L100

If the sentece ends in a Capital word, and has a dot, these will be considered as a single token, and another dot will be added: "My name is John." -> "My", "name", "is", "John.", "."

Strange thing is that this is only used with ".", but not with "!", "?"

@awasthiabhijeet can you tell me what was the purpose for this?

awasthiabhijeet commented 4 years ago

Hi @alexrus Sorry for the late reply.

We assumed that the input string already comes in the form of a basic tokenized string (such as that provided by nltk). E.g. Instead of I am Mr. John. we expect I am Mr. John . as an input.

So, the use-case of the above function is to avoid splitting abbreviations like Mr. . Cases like John. are handled in the basic tokenization step. (Almost all the datasets in our experiments were already tokenized in this manner)