earlyprint / earlyprint.github.io

Homepage for the EarlyPrint Project: Curating and Exploring Early Printed English
https://earlyprint.org/
2 stars 2 forks source link

Feature (word and attribute) based count statistics. #13

Open pibburns opened 5 years ago

pibburns commented 5 years ago

In WordHoard and Monk we offered a number of count-based statistics and displays. Monk is no longer available for perusal, but the WordHoard documentation discusses some of the approaches we implemented.

http://wordhoard.northwestern.edu/

See especially the methods referenced from the Introduction to Analysis Methods section of the WordHoard documentation.

http://wordhoard.northwestern.edu/userman/analysis-intro.html

The methods listed are still useful. but some newer methods would also be useful.

craigberry commented 5 years ago

The sources to WordHoard are available here:

https://github.com/craigberry/wordhoard

On Oct 12, 2019, at 4:12 PM, pibburns notifications@github.com wrote:

In WordHoard and Monk we offered a number of count-based statistics and displays. Monk is no longer available for perusal, but the WordHoard documentation discusses some of the approaches we implemented.

http://wordhoard.northwestern.edu/

See especially the methods referenced from the Introduction to Analysis Methods section of the WordHoard documentation.

http://wordhoard.northwestern.edu/userman/analysis-intro.html

The methods listed are still useful. but some newer methods would also be useful.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.


Craig A. Berry

"... getting out of a sonnet is much more difficult than getting in." Brad Leithauser

pibburns commented 5 years ago

The methods listed are still useful. but some newer methods would also be useful.

In the years since we worked on WordHoard, a number of researchers have pointed out that one method we used for comparing counts between works or groups of works -- popularly called Dunning's log-likelihood -- suffers from various defects. A good summary up to 2015 appears in a paper by Jefrey Lijffijit et al entitled "Significance Testing of Word Frequencies in Corpora" (https://users.ics.aalto.fi/lijffijt/articles/lijffijt2015a.pdf). They suggest using a bootstrap procedure, the venerable Wilcoxon rank-sum test (AKA Mann-Whitney U test), and Welch's t-test.

martinmueller39 commented 5 years ago

I’ve certainly observed the “anti-conservative” bias of the log likelihood ratio, and it’s useful to take a “their geese are swans” attitude towards the results. That said, the profiling of an ‘analysis corpus’ against a ‘reference corpus’ typically yields plausible result and squares with you intuitions if you know the data.

One of the examples in the essay slightly misleading. It’s about the name ‘Mathilda’ in a comparison of a relatively small (hundreds) corpus of novels by men and women. In that case, one novel had some 400 occurrences of the name, which distorted the analysis. It’s interesting the learn that an older statistical method may give better results.

Perhaps we should learn from Nat Silver and average different ‘polls’. Given hypothesis A we should use procedures x, y, and z. If they give wildly different result, there probably isn’t a story to be told in the first place. If they point in the same direction the average may be the best guide.

From: pibburns notifications@github.com Reply-To: "earlyprint/earlyprint.github.io" reply@reply.github.com Date: Sunday, October 13, 2019 at 10:45 AM To: "earlyprint/earlyprint.github.io" earlyprint.github.io@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: Re: [earlyprint/earlyprint.github.io] Feature (word and attribute) based count statistics. (#13)

The methods listed are still useful. but some newer methods would also be useful.

In the years since we worked on WordHoard, a number of researchers have pointed out that one method we used for comparing counts between works or groups of works -- popularly called Dunning's log-likelihood -- suffers from various defects. A good summary up to 2015 appears in a paper by Jefrey Lijffijit et al entitled "Significance Testing of Word Frequencies in Corpora" (https://users.ics.aalto.fi/lijffijt/articles/lijffijt2015a.pdfhttps://urldefense.proofpoint.com/v2/url?u=https-3A__users.ics.aalto.fi_lijffijt_articles_lijffijt2015a.pdf&d=DwMCaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=rT9AMJm4Fi9_Lw-PxvlF1vqc338YmtnPKurMpR9jF_g&s=Y9US3aj9_dcyoERloC0V9mHFvk1DKmRzkSlEpQHmamY&e=). They suggest using a bootstrap procedure, the venerable Wilcoxon rank-sum test (AKA Mann-Whitney U test), and Welch's t-test.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_earlyprint_earlyprint.github.io_issues_13-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DABL7UL6R3IIQUZI2RWGYJL3QOM67XA5CNFSM4JAEMCBKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBCZA2Q-23issuecomment-2D541429866&d=DwMCaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=rT9AMJm4Fi9_Lw-PxvlF1vqc338YmtnPKurMpR9jF_g&s=wZ0HiumNEyjsoEnA_YMrcNk0s-kPxzthCpySSydXKFU&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ABL7UL622O2M6Z4KTFMQYCDQOM67XANCNFSM4JAEMCBA&d=DwMCaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=rT9AMJm4Fi9_Lw-PxvlF1vqc338YmtnPKurMpR9jF_g&s=iyo7gylT-bpvIdjdZtGSTOnIu3kdWagyPNwYydID4c8&e=.

pibburns commented 5 years ago

Fisher's G (Dunning's log-likelihood) tests for independence of word identity and (sub)corpus identity. A basic assumption is that each use of a word is independent of the others. This assumption frequently doesn't hold.

Perhaps we should learn from Nat Silver and average different ‘polls’. Given hypothesis A we should use procedures x, y, and z. If they give wildly different result, there probably isn’t a story to be told in the first place. If they point in the same direction the average may be the best guide.

It never hurts to try different approaches. An "outlier" result may point to something interesting.

Some years back Ted Underwood tried combining the G test with the Wilcoxon test. As I recall, the outcome was that just using the Wilcoxon test worked as well as combining the measures for his purposes.

All this is something to keep in mind while we tackle the more mundane aspects of the EarlyPrint project.

-- Philip R. "Pib" Burns Research Computing Services Northwestern University, Evanston, IL. USA pib@northwestern.edu


From: martinmueller39 notifications@github.com Sent: Sunday, October 13, 2019 1:54 PM To: earlyprint/earlyprint.github.io earlyprint.github.io@noreply.github.com Cc: Philip R Burns pib@northwestern.edu; Author author@noreply.github.com Subject: Re: [earlyprint/earlyprint.github.io] Feature (word and attribute) based count statistics. (#13)

I’ve certainly observed the “anti-conservative” bias of the log likelihood ratio, and it’s useful to take a “their geese are swans” attitude towards the results. That said, the profiling of an ‘analysis corpus’ against a ‘reference corpus’ typically yields plausible result and squares with you intuitions if you know the data.

One of the examples in the essay slightly misleading. It’s about the name ‘Mathilda’ in a comparison of a relatively small (hundreds) corpus of novels by men and women. In that case, one novel had some 400 occurrences of the name, which distorted the analysis. It’s interesting the learn that an older statistical method may give better results.

Perhaps we should learn from Nat Silver and average different ‘polls’. Given hypothesis A we should use procedures x, y, and z. If they give wildly different result, there probably isn’t a story to be told in the first place. If they point in the same direction the average may be the best guide.

From: pibburns notifications@github.com Reply-To: "earlyprint/earlyprint.github.io" reply@reply.github.com Date: Sunday, October 13, 2019 at 10:45 AM To: "earlyprint/earlyprint.github.io" earlyprint.github.io@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: Re: [earlyprint/earlyprint.github.io] Feature (word and attribute) based count statistics. (#13)

The methods listed are still useful. but some newer methods would also be useful.

In the years since we worked on WordHoard, a number of researchers have pointed out that one method we used for comparing counts between works or groups of works -- popularly called Dunning's log-likelihood -- suffers from various defects. A good summary up to 2015 appears in a paper by Jefrey Lijffijit et al entitled "Significance Testing of Word Frequencies in Corpora" (https://users.ics.aalto.fi/lijffijt/articles/lijffijt2015a.pdfhttps://urldefense.proofpoint.com/v2/url?u=https-3A__users.ics.aalto.fi_lijffijt_articles_lijffijt2015a.pdf&d=DwMCaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=rT9AMJm4Fi9_Lw-PxvlF1vqc338YmtnPKurMpR9jF_g&s=Y9US3aj9_dcyoERloC0V9mHFvk1DKmRzkSlEpQHmamY&e=). They suggest using a bootstrap procedure, the venerable Wilcoxon rank-sum test (AKA Mann-Whitney U test), and Welch's t-test.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_earlyprint_earlyprint.github.io_issues_13-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DABL7UL6R3IIQUZI2RWGYJL3QOM67XA5CNFSM4JAEMCBKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBCZA2Q-23issuecomment-2D541429866&d=DwMCaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=rT9AMJm4Fi9_Lw-PxvlF1vqc338YmtnPKurMpR9jF_g&s=wZ0HiumNEyjsoEnA_YMrcNk0s-kPxzthCpySSydXKFU&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ABL7UL622O2M6Z4KTFMQYCDQOM67XANCNFSM4JAEMCBA&d=DwMCaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=rT9AMJm4Fi9_Lw-PxvlF1vqc338YmtnPKurMpR9jF_g&s=iyo7gylT-bpvIdjdZtGSTOnIu3kdWagyPNwYydID4c8&e=.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_earlyprint_earlyprint.github.io_issues_13-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DABMDOIBJ7RPLWII6ANHKBYTQONVGFA5CNFSM4JAEMCBKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBC5EZQ-23issuecomment-2D541446758&d=DwMFaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=amBs3omh8kUTFR4g6MSKkXTA2I23_DyaK3qFS0QH9gU&m=guCzB7uAJg3qjQCvvdIF14F9JmrULcZsBnYB2cZNPMQ&s=JCmEP3CgSJjyDxXVsTm9bx1WunwkDgJKO0H_YJINwtU&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ABMDOIH5ZXNCMCMJZ4Z4Q6TQONVGFANCNFSM4JAEMCBA&d=DwMFaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=amBs3omh8kUTFR4g6MSKkXTA2I23_DyaK3qFS0QH9gU&m=guCzB7uAJg3qjQCvvdIF14F9JmrULcZsBnYB2cZNPMQ&s=h2WVCoFwG2YbJ7eZcWQNHL8jgmv_siV6AJaowJLW4w4&e=.

jrladd commented 5 years ago

We should also put TF-IDF in the mix here, as it gets at some of the same "relative uniqueness" issues as some of these other measures. It also has the advantage of being the method we already use as part of the Disco Engine.

We may eventually want to spin some of these out into their own user-facing products, but for now we can run a few experiments on different kinds of methods, even creating blog posts or Jupyter notebooks that show how to do some of these things and what the advantages of different measures are. As I mentioned to Martin already, I'd be happy to try some of these out and create a few prospective visualizations of various subcorpora.

pibburns commented 5 years ago

We should also put TF-IDF in the mix here, as it gets at some of the same "relative uniqueness" issues as some of these other measures. It also has the advantage of being the method we already use as part of the Disco Engine.

TF-IDF is a different sort of animal from Fisher's G or Wilcoxon's test. TF-IDF is a weighting scheme. Fisher's G and Wilconxon's test are statistics with distributional properties that can be used for hypothesis testing.