Closed MarissaSkud closed 5 years ago
Idea for feature:
Additionally, if furniture polish were not in the bigram dict AND furniture-polish were not in the word set, it might also make sense to determine whether the individual components, furniture and polish, were in the word set
The above is mostly implemented now, except need to account for the fact that all words in word set have been converted to lower case. (Right now, if you type in "Dinner parties," it will tell you that "Dinner" is not in the 1810s word set... because it isn't, but "dinner" is.) Can probably be solved with a simple use of .lower()
Added .lower() to resolve aforementioned problem... now I'm thinking twice about the fact that the bigram dictionary is case-sensitive but the word set isn't, but that would be a new issue if I decide to tackle it. Marking this as resolved for now.
English writing from 1800-1923 contains a lot more hyphenated compounds than is currently standard for written English. For instance, furniture-polish appears in the 1900s word set but furniture polish is not in the 1900s bigram dict; dinner-parties is in the 1810s set but dinner parties is not in 1810s dict. Could use a feature that looks at BOTH the word set and bigram dict to tell you whether the bigram exists, BUT HYPHENATED as a word.