Closed xih closed 9 years ago
The master dictionary is in: /var/local/destress/tokenized/
under masterDict.dmat
and masterDict.sbmat
For your binary representation example. Do the numbers mean, counts of word, or the indices you want to use? Usually the text gets converted from a specific 'word' -> index of word in dictionary In BIDMat, the dictionary field cstr has all the words order by index, eg: 'dummy', 'is', 'scala', 'great' The xml file with text 'scala is great' gets converted into an imat 2,1,3.
Note: 'dummy' is added in method loadDict
in utils.scala (check folder destress/process_data/). We do this because xmltweet converts the first index word of the dictionary into the integer 1. But in scala, indexing starts at zero.
We talked today about what the exact structure of the matrices... none of us were sure how they were structured.
the masterDict files just have the combined mappings - but where are the converted posts that use those mappings (and what is the structure)? We should put this in the Wiki (or alternatively in a README or something).
There is some info on the wiki at https://github.com/berkeley-dsc/destress/wiki/Processing-Pipeline
On Monday, April 6, 2015, Dav Clark notifications@github.com wrote:
We talked today about what the exact structure of the matrices... none of us were sure how they were structured.
the masterDict files just have the combined mappings - but where are the converted posts that use those mappings (and what is the structure)? We should put this in the Wiki (or alternatively in a README or something).
— Reply to this email directly or view it on GitHub https://github.com/berkeley-dsc/destress/issues/31#issuecomment-90318107 .
Ah! Thank you - that's very clear. I'll leave this for @xih to close.
Hey @coryschillaci, @anasrferreira
Where is the combined dictionary located ?
I'm looking in /var/local/destress/ but all I see are multiple xx_dict.imat files or xx_dict.sbmat files. There is no one combined / golden one.
Followup:
How would I go converting this binary representation back to text representation ? i.e. like this
{'minors': 11, 'graph': 10, 'system': 5, 'trees': 9, 'eps': 8, 'computer': 0, 'survey': 4, 'user': 7, 'human': 1, 'time': 6, 'interface': 2, 'response': 3}
Cheers