Closed MastafaF closed 3 years ago
The two metrics give you a sense of how common the word is in the corpus, as well as a sense of a term's df, which is somewhat proportional to the variance of a term in a corpus. 25k was used in https://archive.nytimes.com/www.nytimes.com/interactive/2012/09/06/us/politics/convention-word-counts.html, and seems to give reasonable values for most term usage rates. Reasonable being ~2-a few hundred. I you're really motivated, you could do the math on this, and figure out what the optimal denomination for interpretable statistics about the top K words in a Zipfian distribution.
The set of terms used to count the number of terms in a corpus only covers the words currently represented in Scattertext. You can look up cat25, for example, in the code and see the derivation there.
Thanks for the link @JasonKessler. I will dig this up a bit more to see how they came up with 25K, might simply an empirical value found that fits well their data...
I did look at the source code and would like to have you validate the approach. Could you explicitly derive an equation in this thread similar to the ones I wrote above? For example, given a Topic/Term of interest: n1k = (number_mentions/doc_count_in_Target_category)*1000
Something similar will be helpful and will surely help future users without the need to go through each line of the source code. I gave an example above with actual values displayed on scattertext, it will be great once deriving the equation to validate it with such example.
I have limited time to spend on this.
I'd encourage you to read my full comment above. Given word distributions are broadly similar (i.e., power laws with exponents ~2-3) you'd expect high frequency terms to have similar usage percentages.
If you think there's something off in the formula, please provide a program and a dataset I can run which generates one value for a particular topic and does not produce a specific value you'd expect.
The topic formulation should be the one you provided.
Hi,
Could you please clarify the aim of the statistics displayed on the visual. Namely the following ones:
I am aware one if focusing on the terms and the other on the document. I guess they should be complementary to each other. However, I have a hard time understanding the first one as I cannot reproduce it mathematically with different examples.
For example, let's consider the following:
Reference Category document count: 2,684; word count: 8,361 Target Category document count: 1,377; word count: 4,582
We use the bigram approach and for the Topic linked to the bigram "word1 word2", we have for example:
Target Category frequency: 1200 per 25,000 terms 4 per 1,000 docs Some of the 6 mentions:
To get n1k which, n my mind, should refer to "the expected number of document refering to Topic in a set of 1000 documents given the Target Category", I compute: (number_mentions/doc_count_in_Target_category)*1000
Now, when it comes to n25K, I cannot come up with a similar equation or explanation for it. I think what would make sense is to have something like "the expected number of terms refering to "Topic" in a set of 25K terms given the Target Category". In such case, the equation to compute should be: (number_terms_related_to_Topic/number_terms_in_Target_Category)*25000
But it does not seem to be the equation. Not sure why 25K is the number of terms chosen as well... Any insight here would be appreciated.