MarissaSkud / Wordsworth

A web app (wordsworth.us) to identify anachronistic words & phrases in historical fiction by comparing it to fiction written during that era. Hackbright Fellowship final project.
MIT License
5 stars 0 forks source link

New feature idea: BIGRAM or HYPHENATED? #10

Closed MarissaSkud closed 5 years ago

MarissaSkud commented 5 years ago

English writing from 1800-1923 contains a lot more hyphenated compounds than is currently standard for written English. For instance, furniture-polish appears in the 1900s word set but furniture polish is not in the 1900s bigram dict; dinner-parties is in the 1810s set but dinner parties is not in 1810s dict. Could use a feature that looks at BOTH the word set and bigram dict to tell you whether the bigram exists, BUT HYPHENATED as a word.

MarissaSkud commented 5 years ago

Idea for feature:

Additionally, if furniture polish were not in the bigram dict AND furniture-polish were not in the word set, it might also make sense to determine whether the individual components, furniture and polish, were in the word set

MarissaSkud commented 5 years ago

The above is mostly implemented now, except need to account for the fact that all words in word set have been converted to lower case. (Right now, if you type in "Dinner parties," it will tell you that "Dinner" is not in the 1810s word set... because it isn't, but "dinner" is.) Can probably be solved with a simple use of .lower()

MarissaSkud commented 5 years ago

Added .lower() to resolve aforementioned problem... now I'm thinking twice about the fact that the bigram dictionary is case-sensitive but the word set isn't, but that would be a new issue if I decide to tackle it. Marking this as resolved for now.