BIDS-collaborative / destress

Helping @peparedes with text analysis of livejournal data
ISC License
7 stars 2 forks source link

Basic statistics on dataset #7

Closed coryschillaci closed 9 years ago

coryschillaci commented 9 years ago

We should determine some basic properties of the data set. Some examples:

davclark commented 9 years ago

Update today from @coryschillaci:

About 30% of posts with text have a mood id!

Do you want to make a report somewhere shared, @coryschillaci?

coryschillaci commented 9 years ago

The plan is to add some statistics tracking to the featurizer code, then I'll post some results to the Wiki. If anyone has statistics they especially want, please let me know.

coryschillaci commented 9 years ago

Actually it turns out that nearly 50% of posts with string content have moodid tags!

There are a total of 1,261,814 users in the data set.
Of these, 968,217 have at least one <string> post.
There are a total of 81,488,018 <post> fields in the xml files.
There are a total of 64,326,865 <string> posts, 49.69% of which have a moodid tag.
peparedes commented 9 years ago

Nice... Very interesting

P On Apr 2, 2015 11:44 AM, "Cory Schillaci" notifications@github.com wrote:

Actually it turns out that nearly 50% of posts with string content have moodid tags!

There are a total of 1,261,814 users in the data set. Of these, 968,217 have at least one post. There are a total of 81,488,018 fields in the xml files. There are a total of 64,326,865 posts, 49.69% of which have a moodid tag.

— Reply to this email directly or view it on GitHub https://github.com/berkeley-dsc/destress/issues/7#issuecomment-89006005.

coryschillaci commented 9 years ago

The five most common moodids are

44 - amused - 4.04% 31 - tired - 3.76% 15 - happy - 3.08% 5 - bored - 2.62% 125 - cheerful - 2.39%

The five least common moodids are

128 - intimidated - 0.049% 80 - envious - 0.064% 77 - recumbent - 0.067% 81 - sympathetic - 0.069% 133 - jealous - 0.078%

peparedes commented 9 years ago

Any idea of the variation of emotion per user?

P On Apr 2, 2015 11:54 AM, "Cory Schillaci" notifications@github.com wrote:

The five most common moodids are

44 - amused - 4.04% 31 - tired - 3.76% 15 - happy - 3.08% 5 - bored - 2.62% 125 - cheerful - 2.39%

The five least common moodids are

128 - intimidated - 0.049% 80 - envious - 0.064% 77 - recumbent - 0.067% 81 - sympathetic - 0.069% 133 - jealous - 0.078%

— Reply to this email directly or view it on GitHub https://github.com/berkeley-dsc/destress/issues/7#issuecomment-89008418.

coryschillaci commented 9 years ago

@peparedes How would you suggest measuring that?

coryschillaci commented 9 years ago

Without trimming the dictionaries, the following statistics apply to the distribution of word counts for each string post:

mean: 362.51 median: 214 mode: 41

After trimming the dictionaries and retaining only the words which exist in the trimmed dictionary, the following statistics apply to the distribution of word counts for each string post:

mean: 361.54 median: 213 mode: 41

coryschillaci commented 9 years ago

188,847 posts with moodids have 5 words or less

peparedes commented 9 years ago

Lest start by counting per user... A matrix of users against moodids with the number of counts on each element...

A most elaborate approach would be a timeline per user.

P On Apr 2, 2015 12:17 PM, "Cory Schillaci" notifications@github.com wrote:

@peparedes https://github.com/peparedes How would you suggest measuring that?

— Reply to this email directly or view it on GitHub https://github.com/berkeley-dsc/destress/issues/7#issuecomment-89014671.

coryschillaci commented 9 years ago

OK that's a bit of work. I'll try to get around to it this weekend.

peparedes commented 9 years ago

Thanks Cory... if you give me some guidance I can help. lmk if you have a bit of time tomorrow.

P

On Thu, Apr 2, 2015 at 1:21 PM, Cory Schillaci notifications@github.com wrote:

OK that's a bit of work. I'll try to get around to it this weekend.

— Reply to this email directly or view it on GitHub https://github.com/berkeley-dsc/destress/issues/7#issuecomment-89033465.

coryschillaci commented 9 years ago

I think it will be most straightforward to make a new function in featurizers.scala which is a variation of the current featurizeMoodID. If you can figure out the code that @anasrferreira and I wrote, this version should actually be simpler. It's probably easiest for me to do it though, since I worked on the other code most recently.

coryschillaci commented 9 years ago

@peparedes I created the matrix you described with the counts of moodids per user. 38d5635cc137aa4f15751832710e289f7eeba8eb

You can find this file in /var/local/destress/dataForICA/moodsByUser.smat.lz4. Each column corresponds to one user, where the row number indicates the moodid. Rows 0, 50 and 94 are always zero so that the line corresponds directly to moodid.

For convenience, @anasrferreira created a dictionary which decodes the moodid numbers, /var/local/destress/moodDict.sbmat.

coryschillaci commented 9 years ago

Some additional statistics from making the moodsByUser set:

There are a total of 1,261,814 users in the data set.
Of these, 780,471 (61.85%) have at least one valid integer moodid.
The total number of <current_mood> and <current_moodid> tags is 42,991,943.
There are 6,987,767 custom moods specified by strings (16.25% of the total).
The total number of <current_moodid> tags is 35,216,612 (81.91% of the total). 
    Of these, 35,211,085 (99.984%) are valid integers.
coryschillaci commented 9 years ago

Regarding the number of moodids per user in this matrix,

mean: 43.7 median: 11 mode: 1