JasonKessler / scattertext

Beautiful visualizations of how language differs among document types.
Apache License 2.0
2.23k stars 287 forks source link

Clarify statistics in the visual #81

Closed MastafaF closed 3 years ago

MastafaF commented 3 years ago

Hi,

Could you please clarify the aim of the statistics displayed on the visual. Namely the following ones:

I am aware one if focusing on the terms and the other on the document. I guess they should be complementary to each other. However, I have a hard time understanding the first one as I cannot reproduce it mathematically with different examples.

For example, let's consider the following:

Reference Category document count: 2,684; word count: 8,361 Target Category document count: 1,377; word count: 4,582

We use the bigram approach and for the Topic linked to the bigram "word1 word2", we have for example:

Target Category frequency: 1200 per 25,000 terms 4 per 1,000 docs Some of the 6 mentions:

To get n1k which, n my mind, should refer to "the expected number of document refering to Topic in a set of 1000 documents given the Target Category", I compute: (number_mentions/doc_count_in_Target_category)*1000

Now, when it comes to n25K, I cannot come up with a similar equation or explanation for it. I think what would make sense is to have something like "the expected number of terms refering to "Topic" in a set of 25K terms given the Target Category". In such case, the equation to compute should be: (number_terms_related_to_Topic/number_terms_in_Target_Category)*25000

But it does not seem to be the equation. Not sure why 25K is the number of terms chosen as well... Any insight here would be appreciated.

JasonKessler commented 3 years ago

The two metrics give you a sense of how common the word is in the corpus, as well as a sense of a term's df, which is somewhat proportional to the variance of a term in a corpus. 25k was used in https://archive.nytimes.com/www.nytimes.com/interactive/2012/09/06/us/politics/convention-word-counts.html, and seems to give reasonable values for most term usage rates. Reasonable being ~2-a few hundred. I you're really motivated, you could do the math on this, and figure out what the optimal denomination for interpretable statistics about the top K words in a Zipfian distribution.

The set of terms used to count the number of terms in a corpus only covers the words currently represented in Scattertext. You can look up cat25, for example, in the code and see the derivation there.

MastafaF commented 3 years ago

Thanks for the link @JasonKessler. I will dig this up a bit more to see how they came up with 25K, might simply an empirical value found that fits well their data...

I did look at the source code and would like to have you validate the approach. Could you explicitly derive an equation in this thread similar to the ones I wrote above? For example, given a Topic/Term of interest: n1k = (number_mentions/doc_count_in_Target_category)*1000

Something similar will be helpful and will surely help future users without the need to go through each line of the source code. I gave an example above with actual values displayed on scattertext, it will be great once deriving the equation to validate it with such example.

JasonKessler commented 3 years ago

I have limited time to spend on this.

I'd encourage you to read my full comment above. Given word distributions are broadly similar (i.e., power laws with exponents ~2-3) you'd expect high frequency terms to have similar usage percentages.

If you think there's something off in the formula, please provide a program and a dataset I can run which generates one value for a particular topic and does not produce a specific value you'd expect.

The topic formulation should be the one you provided.