Closed stephantul closed 6 years ago
Normalized frequency counts have been added. After some discussion, I chose to normalize to 1M, since we can expect most frequency norms to be based on more than 1M words.
We currently determine the normalized frequency counts by summing the one-smoothed counts in our corpus of choice. We then divide this quantity by 1M. This gives us the most accurate estimate of a normalization term, since this is exactly the quantity we use when sampling with our built-in samplers.
I ran some tests, and the transformation seems to have the desired effect: for the English and Dutch Celex corpora, the normalized counts sum to a number close to 1M, while for both corpora the most frequent words are very close in frequency.
Currently we treat frequencies as being on the same scale even though they might not be on the same scale. That is, if we currently combine two databases of frequency norms, and one of them was counted on the basis of 10M words, while the other was counted on the basis of 1M words, we will overestimate the frequency of the 10M corpus by a factor of 10.
Some solutions:
The first option is straightforward, second one is a bit more difficult, especially because we might not know this for every source.