Closed maia closed 8 years ago
P.S. The above list contains all single letters, as each of these can be an abbreviated first name.
Thanks! Added these in: https://github.com/diasks2/pragmatic_tokenizer/commit/0a8ecfdae1e87a632c8f99d81b998011b282525f
Thanks!
I suggest to use the following abbreviations array, which is based on the old one minus some abbreviations I don't think are used commonly, plus frequent abbreviations in a subset of my tweets database.
In case this array is too long, I can query my complete (but still non-representative) db of tweets and remove the ones with the least occurrence frequency.