reread Baayen to see if there's something better we can do than frequency per 10k

jtauber commented 6 years ago

While frequency per 10k is a much better measure than raw counts, it's still not completely independent of corpus size as Harald Baayen has argued extensively.

I need to reread his book Word Frequency Distributions to better understand the alternatives he proposes.

gregorycrane commented 6 years ago

I downloaded Baayen and started to look at it. We clearly need to digest this -- the frequency per 10K model is something that I came up with on the fly in c. 1990 (I fiddled a bit and came up with 10K as a good denominator without knowing anything about the general field). The question is not just the Platonic ideal but what we can communicate to an interested audience. I am not talking about people who resist quantitative thinking but about people who want to use quantitative thinking but who must quickly see what they are getting and why. There is an awful lot of "how" in this book but not a lot of obvious "Why" or "so what?"

gregorycrane commented 6 years ago

My primary use case is "which words in this text stand out? which are unusually common? or uncommon?" The vocab size case is different -- though very important. Its time to do some market research, though, I think.

jtauber commented 6 years ago

I think at issue is that it would be misleading to say work A uses a word more often than work B based on the frequency if the size of work A and work B differ substantially.

To give a particularly extreme example: if a word appears once in a 1000-word work, it's frequency per 10k will be 10, but if that word appears once in a 100,000-word work, it's frequency per 10k will be 0.1. While it's technically correct to say the word is much more frequent in the first work than in the second work, drawing any conclusions from that is highly dangerous.

Comparing frequencies is less of a problem (a) the closer in token-size the two works are; (b) the larger the absolute occurrence count is.

Importantly, because the log ratio compares the frequency per 10k in a work versus the frequency per 10k in an entire corpus, the relative token-size of the two things being compared can be quite different.

I've slightly mitigated the issue by (a) not showing log ratio on individual passages, only entire works; (b) even on entire works, not showing the log ratio for words only occurring once.

I was going to take another look at Baayen's book in case there was a better metric we could use and/or a better way of visualising unusually common / unusually uncommon than the log ratio visualisation I currently have.

Of course, I don't want to turn the vocabulary tool or Scaife into a full-blown corpus analysis too (not yet anyway) but I'm just looking for low-hanging fruit.

deep-philology / DeepVocabulary

reread Baayen to see if there's something better we can do than frequency per 10k #75