BIDS-collaborative / destress

Helping @peparedes with text analysis of livejournal data
ISC License
7 stars 2 forks source link

Process LiveJournal into sentences for BIDMach #40

Closed coryschillaci closed 9 years ago

coryschillaci commented 9 years ago

The LiveJournal text needs to be processed into a format ready for BIDMach to train skip-thought representations (also the same as needed for word2vec). @geneyoo or @xih are either of you willing to work on this over the next week? If so please assign yourself! @coryschillaci has some code that does this for twitter and shouldn't be too hard to modify.

coryschillaci commented 9 years ago

FYI the format is a two row array, in each column there is a sentence ID number on the first row (not necessary that it be unique, only that it changes between adjacent sentences) and the dictionary token for the word on the second row.

coryschillaci commented 9 years ago

Since nobody volunteered and I had some spare time this morning, I went ahead and implemented. See commit a0f6425441cf72f17981d30e3090bbc751085e68

It might be good to consider a few issues at some point:

  1. Currently line breaks are eaten by the flex tokenizer, might be good to keep them as sentence breaks.
  2. Acronyms with periods currently break sentences.
  3. Might be good to tokenize contractions such as 'nt explicitly, I've seen this in several papers
  4. Currently the sentence breaking punctuation is discarded, but could also keep them in the featurization
xih commented 9 years ago

Hey Cory, sorry for being super mia. I've been a caught up with the last week of summer session. We can talk more about what needs to be done.