diasks2 / pragmatic_tokenizer

A multilingual tokenizer to split a string into tokens
MIT License
90 stars 11 forks source link

updated german abbreviations #3

Closed maia closed 8 years ago

maia commented 8 years ago

I suggest to use the following abbreviations array, which is based on the old one minus some abbreviations I don't think are used commonly, plus frequent abbreviations in a subset of my tweets database.

  ABBREVIATIONS = ['a', 'a.d', 'a.k.a', 'a.s.a.p', 'abg', 'alt', 'apr', 'art', 'aug', 'b',
                   'b.a', 'b.s', 'best', 'bgm', 'bldg', 'btw', 'buchst', 'bzgl', 'bzw', 'c',
                   'ca', 'co', 'd', 'd.d', 'd.h', 'd.r', 'dergl', 'dez', 'dgl', 'dr', 'dr ',
                   'dt', 'dzt', 'e', 'e.l', 'e.u', 'e.v', 'ehem', 'eig', 'etc', 'etc.p.p',
                   'eu', 'europ', 'ev', 'ev ', 'evtl', 'f', 'f.d', 'feat', 'feb', 'ff',
                   'fr', 'frz', 'ft', 'g', 'gg', 'ggf', 'ggü', 'griech', 'h', 'h.c', 'h.p',
                   'hon', 'hosp', 'hr', 'i', 'i.a', 'i.d', 'i.d.r', 'i.f', 'i.p', 'i.z',
                   'ii', 'iii', 'inkl', 'int', 'iv', 'ix', 'j', 'jan', 'jul', 'jun', 'k',
                   'k.a', 'k.i.z', 'k.o', 'k.u.k', 'kath ', 'l', 'l.a', 'lfd', 'lt', 'ltd',
                   'm', 'm.e', 'm.w', 'mag', 'max', 'me', 'med', 'mind', 'mio', 'mme', 'mr',
                   'mrd', 'mrs', 'ms', 'mwst', 'mär', 'n', 'nov', 'nr', 'o', 'o.k', 'o.ä',
                   'oct', 'okt', 'omg', 'oö', 'p', 'p.a', 'p.m', 'p.s', 'p.t', 'pol', 'pp',
                   'prof', 'präs', 'q', 'r', 'r.i.p', 'r.r', 'ranz', 'rd', 'rep', 'rt',
                   'russ', 's', 's.g', 'sen', 'sep', 'sog', 'st', 'std', 'str', 't', 'türk',
                   'u', 'u.a', 'u.a  ', 'u.a.m', 'u.a.v', 'u.k', 'u.s', 'u.s.w', 'u.u',
                   'u.v.a', 'u.v.m', 'u.ä', 'ungar', 'usf', 'usw', 'uvm', 'v', 'v.a', 'v.d',
                   'v.m', 'vgl', 'vi', 'vii', 'viii', 'vs', 'w', 'wg', 'wr', 'x', 'xi',
                   'xii', 'xiii', 'xiv', 'xix', 'xv', 'xvi', 'xvii', 'xviii', 'xx', 'y',
                   'z', 'z.b', 'z.t', 'z.z', 'z.zt', 'zb', 'zt', 'zw', 'zzt', 'ä', 'ö',
                   'öffentl', 'öst', 'österr', 'ü'].freeze

In case this array is too long, I can query my complete (but still non-representative) db of tweets and remove the ones with the least occurrence frequency.

maia commented 8 years ago

P.S. The above list contains all single letters, as each of these can be an abbreviated first name.

diasks2 commented 8 years ago

Thanks! Added these in: https://github.com/diasks2/pragmatic_tokenizer/commit/0a8ecfdae1e87a632c8f99d81b998011b282525f

maia commented 8 years ago

Thanks!