1-800-BAD-CODE / punctuators

Package for inference for punctuation, true-casing, and sentence boundary detection
23 stars 2 forks source link

Outputting OOV tokens as is #6

Open mgoldenbe opened 10 months ago

mgoldenbe commented 10 months ago

I realize that one is supposed to remove punctuation marks from the input text before using the model. But what do we do with things like "24/7", "R&D", "9-11" etc. in the input text? There are potentially a lot of such things and it is hard to catch all of them in the preprocessing. Is it possible to get OOV tokens in the output verbatim as they appear in the input instead of <ukn>?