BIDS-collaborative / destress

Helping @peparedes with text analysis of livejournal data
ISC License
7 stars 2 forks source link

word2vec #31

Closed xih closed 9 years ago

xih commented 9 years ago

Hey @coryschillaci, @anasrferreira

Where is the combined dictionary located ?

I'm looking in /var/local/destress/ but all I see are multiple xx_dict.imat files or xx_dict.sbmat files. There is no one combined / golden one.

Followup:

How would I go converting this binary representation back to text representation ? i.e. like this

{'minors': 11, 'graph': 10, 'system': 5, 'trees': 9, 'eps': 8, 'computer': 0, 'survey': 4, 'user': 7, 'human': 1, 'time': 6, 'interface': 2, 'response': 3}

Cheers

anasrferreira commented 9 years ago

The master dictionary is in: /var/local/destress/tokenized/ under masterDict.dmat and masterDict.sbmat

For your binary representation example. Do the numbers mean, counts of word, or the indices you want to use? Usually the text gets converted from a specific 'word' -> index of word in dictionary In BIDMat, the dictionary field cstr has all the words order by index, eg: 'dummy', 'is', 'scala', 'great' The xml file with text 'scala is great' gets converted into an imat 2,1,3.

Note: 'dummy' is added in method loadDict in utils.scala (check folder destress/process_data/). We do this because xmltweet converts the first index word of the dictionary into the integer 1. But in scala, indexing starts at zero.

davclark commented 9 years ago

We talked today about what the exact structure of the matrices... none of us were sure how they were structured.

the masterDict files just have the combined mappings - but where are the converted posts that use those mappings (and what is the structure)? We should put this in the Wiki (or alternatively in a README or something).

coryschillaci commented 9 years ago

There is some info on the wiki at https://github.com/berkeley-dsc/destress/wiki/Processing-Pipeline

On Monday, April 6, 2015, Dav Clark notifications@github.com wrote:

We talked today about what the exact structure of the matrices... none of us were sure how they were structured.

the masterDict files just have the combined mappings - but where are the converted posts that use those mappings (and what is the structure)? We should put this in the Wiki (or alternatively in a README or something).

— Reply to this email directly or view it on GitHub https://github.com/berkeley-dsc/destress/issues/31#issuecomment-90318107 .

davclark commented 9 years ago

Ah! Thank you - that's very clear. I'll leave this for @xih to close.