updated german abbreviations

maia commented 8 years ago

I suggest to use the following abbreviations array, which is based on the old one minus some abbreviations I don't think are used commonly, plus frequent abbreviations in a subset of my tweets database.

  ABBREVIATIONS = ['a', 'a.d', 'a.k.a', 'a.s.a.p', 'abg', 'alt', 'apr', 'art', 'aug', 'b',
                   'b.a', 'b.s', 'best', 'bgm', 'bldg', 'btw', 'buchst', 'bzgl', 'bzw', 'c',
                   'ca', 'co', 'd', 'd.d', 'd.h', 'd.r', 'dergl', 'dez', 'dgl', 'dr', 'dr ',
                   'dt', 'dzt', 'e', 'e.l', 'e.u', 'e.v', 'ehem', 'eig', 'etc', 'etc.p.p',
                   'eu', 'europ', 'ev', 'ev ', 'evtl', 'f', 'f.d', 'feat', 'feb', 'ff',
                   'fr', 'frz', 'ft', 'g', 'gg', 'ggf', 'ggü', 'griech', 'h', 'h.c', 'h.p',
                   'hon', 'hosp', 'hr', 'i', 'i.a', 'i.d', 'i.d.r', 'i.f', 'i.p', 'i.z',
                   'ii', 'iii', 'inkl', 'int', 'iv', 'ix', 'j', 'jan', 'jul', 'jun', 'k',
                   'k.a', 'k.i.z', 'k.o', 'k.u.k', 'kath ', 'l', 'l.a', 'lfd', 'lt', 'ltd',
                   'm', 'm.e', 'm.w', 'mag', 'max', 'me', 'med', 'mind', 'mio', 'mme', 'mr',
                   'mrd', 'mrs', 'ms', 'mwst', 'mär', 'n', 'nov', 'nr', 'o', 'o.k', 'o.ä',
                   'oct', 'okt', 'omg', 'oö', 'p', 'p.a', 'p.m', 'p.s', 'p.t', 'pol', 'pp',
                   'prof', 'präs', 'q', 'r', 'r.i.p', 'r.r', 'ranz', 'rd', 'rep', 'rt',
                   'russ', 's', 's.g', 'sen', 'sep', 'sog', 'st', 'std', 'str', 't', 'türk',
                   'u', 'u.a', 'u.a  ', 'u.a.m', 'u.a.v', 'u.k', 'u.s', 'u.s.w', 'u.u',
                   'u.v.a', 'u.v.m', 'u.ä', 'ungar', 'usf', 'usw', 'uvm', 'v', 'v.a', 'v.d',
                   'v.m', 'vgl', 'vi', 'vii', 'viii', 'vs', 'w', 'wg', 'wr', 'x', 'xi',
                   'xii', 'xiii', 'xiv', 'xix', 'xv', 'xvi', 'xvii', 'xviii', 'xx', 'y',
                   'z', 'z.b', 'z.t', 'z.z', 'z.zt', 'zb', 'zt', 'zw', 'zzt', 'ä', 'ö',
                   'öffentl', 'öst', 'österr', 'ü'].freeze

In case this array is too long, I can query my complete (but still non-representative) db of tweets and remove the ones with the least occurrence frequency.

maia commented 8 years ago

P.S. The above list contains all single letters, as each of these can be an abbreviated first name.

diasks2 commented 8 years ago

Thanks! Added these in: https://github.com/diasks2/pragmatic_tokenizer/commit/0a8ecfdae1e87a632c8f99d81b998011b282525f

maia commented 8 years ago

Thanks!

diasks2 / pragmatic_tokenizer

updated german abbreviations #3