Create a Runmode that Splits Sentences Instead of Words

dvfeinblum / lexicount

A (soon-to-be) nlp tool for seeing how obnoxious a writer you are.

MIT License

1 stars 0 forks source link

Create a Runmode that Splits Sentences Instead of Words #12

Open dvfeinblum opened 6 years ago

dvfeinblum commented 6 years ago

Is your feature request related to a problem? Please describe. The meta-purpose of this project is to learn some NLP. Word2Vec is a really nice low-bar-of-entry way of doing that, and vectors for sentences would be a nice place to start.

Describe the solution you'd like Currently, the blog parser sanitizes posts by removing punctuation and then NLTKing the words in the post. We should do something similar but, instead of splitting on spaces, we should split on periods.

Describe alternatives you've considered N/A

Additional context N/A

dvfeinblum commented 6 years ago

Currently, I'm trying to decide where and how to store sentences. One option is to just add them to postgres. Something like

sentence	url	word_count	vector
here's a self-aggrandizing sentence	blag.web/post-1	4	(0.123,0.2431,0.234232,...)

dvfeinblum commented 6 years ago

Oooh; also, one nice thing about the word tokenizer I already wrote is that we can throw out words that don't mean much. Glancing at the word_details table, we can probably toss words with the following part_of_speech:

DT
TO
CC
PRP
IN
PRP$

dvfeinblum commented 6 years ago

Leaving this issue open because the last comment still needs to be implemented!

dvfeinblum commented 6 years ago

Oh well hey now:

https://github.com/tensorflow/tensorflow/blob/r1.1/tensorflow/examples/tutorials/word2vec/word2vec_basic.py