epilys / anatomy-of-melancholy-latex

The 17th century book "The Anatomy Of Melancholy" by Robert Burton typeset with XeLaTeX.
https://epilys.github.io/anatomy-of-melancholy-latex/
3 stars 1 forks source link

Identify uncommon English words for glossary #16

Open epilys opened 3 years ago

epilys commented 3 years ago

Identify words that'd be unfamiliar for a modern English speaker.

Resources

epilys commented 3 years ago

https://stackoverflow.com/questions/59448675/how-to-extract-unusual-unknown-words-in-nlp

epilys commented 3 years ago

Complex Word Identification

Predicting Lexical Complexity in English Texts

epilys commented 3 years ago
  1. use pandoc to get plain text from .tex files (but ignore \textlatin and \textgreek somehow?)

  2. compare bag of words from plain text to some corpus

  3. output uncommon words for glossary

  4. silence textlatin textgreek output with flag

  5. silence page styles, headers

  6. generate dvi

  7. use dvi2tty

epilys commented 3 years ago

https://en.wikipedia.org/wiki/Gunning_fog_index

epilys commented 3 years ago

https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests

epilys commented 3 years ago

British National Corpus

Freq List