BIDS-collaborative / destress

Helping @peparedes with text analysis of livejournal data
ISC License
7 stars 2 forks source link

Trying to plot some mood IDs on a 2-D coordinate system #39

Open DanielTakeshi opened 9 years ago

DanielTakeshi commented 9 years ago

Since other people have written preliminary results on the issues tracker, I will do the same for now.

Following some discussion with @peparedes I've been trying to plot the 132 mood IDs on a 2-D coordinate system so that we can analyze where certain moods fall on an axis system. I did a quick test with the following data: a 132 x 780471 matrix where each column corresponds to a user (who many have multiple documents) and the values are the number of times a mood ID has been attached to one of that user's articles. So if element (i,j) is 3, that means user j wrote three posts that had mood i. Am I correct about this data, @coryschillaci ? Most of the values are zero, but this does not necessarily indicate sparsity because a 0 means a mood was absent, not that it is missing. This distinction is important.

Nevertheless, even if the data were truly sparse in the normal sense, BIDMach's SFA kept giving me NaNs when I ran it on this data, so I tried BIDMach's Non-Negative Matrix Factorization, with a dimension of two. What this means is that for each of the 132 variables (i.e., rows), NMF will find two 132-dimensional vectors, and approximate each of the 132 original variables to be a linear combination of the two vectors we found. This means each variable can be expressed in a 2-D coordinate system, where the coordinates are their coefficients in the linear combination. The weakness (in my opinion) of NMF is that we don't want to constrain our coefficients to be non-negative.

The following image shows a graph that I created from the NMF output. Unfortunately, it's not very informative since there are too many points clustered near the origin and I'm not sure why "amused" is so far up. The axes are not labeled because matrix factorization and factor analysis techniques have axes ambiguity; one can interchange the rows of the sources. Test image I did a 3-D plot, which means NMF with dimension 3, but again, a lot of points were clustered near the origin.

I then tried to do two things that probably have stronger theoretical backings than the above technique. Rather than apply NMF and reduce the data dimensionality from 132 to 2, I instead reduced the data from 132 dimensions to six, then I applied ICA to find the six independent sources. (This follows from what @peparedes suggested about there being six major categories of emotions.) Once I had the six sources, I then tried to find the two most powerful sources. There are several ways to do this, but I decided to find the one that resulted in the largest contribution to the original data, by testing out the sum of squares of the mixing matrix * source matrix when the source matrix only has one row corresponding to that component.

This is described in the bottom of page 165 of this 1998 paper, which deals with ICA in an fMRI application.

Then the top two components/sources form an axes. They are two vectors in 780,471-dimensional space. I then computed the coefficients for the 132 original moods, which consists of two levels of linear combinations. Finally, I plotted the results.

Using NMF, I obtained the following plot: test I also tried using LDA, and got this plot: test The advantage of LDA is that we have positive and negative components in the original mixing matrix (but ICA can introduce negative elements in the mixing matrix anyway).

I am not happy with any of the three plots, and I suspect that the data we have might need to be a little better. There are too many points clustered near the middle, and maybe some outlier points are influencing the results? I'm not sure how to interpret it. Does anyone have thoughts?

I will push the BIDMach script I used to generate the data to this repository. I will double check to make sure the pipeline is working (update it should be OK).

Also, @peparedes can you send me a link to the papers we talked about, the ones that had to do with the six major categories of emotions?