BIDS-collaborative / destress

Helping @peparedes with text analysis of livejournal data
ISC License
7 stars 2 forks source link

6/3/15 Meeting #38

Closed xih closed 7 years ago

xih commented 9 years ago

Word2Vec: @geneyoo @peparedes

Instead of the google-news trained corpus of data we are now going to train our own model using LiveJournal (LJ). Gene will give me (Dennis) the corpus of bag of words model of posts and the I will make the corresponding masterDict. After making this new masterDict of DMat (counts) and IMat (indexes), we estimate that this new dictionary will still be ~900K words. Then after having this new matrix then we can make a matrix of index of words in sentences x length of the new dictionary. Then the process is same as before.

Summary:

  1. Make 2 new masterDicts from (/var/local/destress/text_sents and /var/local/destress/text_sents_ids)
  2. Make a new matrix m = length of master dict n = # of sentences.
  3. Make another trained matrix.
  4. Same as before (Queries)