Closed coryschillaci closed 9 years ago
FYI the format is a two row array, in each column there is a sentence ID number on the first row (not necessary that it be unique, only that it changes between adjacent sentences) and the dictionary token for the word on the second row.
Since nobody volunteered and I had some spare time this morning, I went ahead and implemented. See commit a0f6425441cf72f17981d30e3090bbc751085e68
It might be good to consider a few issues at some point:
'nt
explicitly, I've seen this in several papersHey Cory, sorry for being super mia. I've been a caught up with the last week of summer session. We can talk more about what needs to be done.
The LiveJournal text needs to be processed into a format ready for BIDMach to train skip-thought representations (also the same as needed for word2vec). @geneyoo or @xih are either of you willing to work on this over the next week? If so please assign yourself! @coryschillaci has some code that does this for twitter and shouldn't be too hard to modify.