Add frequency normalization

clips / wordkit

Featurize words into orthographic and phonological vectors.

GNU General Public License v3.0

40 stars 10 forks source link

Normalized frequency counts have been added. After some discussion, I chose to normalize to 1M, since we can expect most frequency norms to be based on more than 1M words.

We currently determine the normalized frequency counts by summing the one-smoothed counts in our corpus of choice. We then divide this quantity by 1M. This gives us the most accurate estimate of a normalization term, since this is exactly the quantity we use when sampling with our built-in samplers.

I ran some tests, and the transformation seems to have the desired effect: for the English and Dutch Celex corpora, the normalized counts sum to a number close to 1M, while for both corpora the most frequent words are very close in frequency.

clips / wordkit

Add frequency normalization #4