Open dvfeinblum opened 6 years ago
Currently, I'm trying to decide where and how to store sentences. One option is to just add them to postgres. Something like
sentence | url | word_count | vector |
---|---|---|---|
here's a self-aggrandizing sentence | blag.web/post-1 | 4 | (0.123,0.2431,0.234232,...) |
Oooh; also, one nice thing about the word tokenizer I already wrote is that we can throw out words that don't mean much. Glancing at the word_details
table, we can probably toss words with the following part_of_speech
:
Leaving this issue open because the last comment still needs to be implemented!
Is your feature request related to a problem? Please describe. The meta-purpose of this project is to learn some NLP. Word2Vec is a really nice low-bar-of-entry way of doing that, and vectors for sentences would be a nice place to start.
Describe the solution you'd like Currently, the blog parser sanitizes posts by removing punctuation and then NLTKing the words in the post. We should do something similar but, instead of splitting on spaces, we should split on periods.
Describe alternatives you've considered N/A
Additional context N/A