BIDS-collaborative / destress

Helping @peparedes with text analysis of livejournal data
ISC License
7 stars 2 forks source link

Create a dataset for running ICA #30

Closed coryschillaci closed 9 years ago

coryschillaci commented 9 years ago

Currently discussing what should be in the dataset with @DanielTakeshi and @jcanny.

coryschillaci commented 9 years ago

Excerpts from an email to @DanielTakeshi

I've managed to create on dataset that I think is worth trying ICA on (although I don't have much intuition for what will happen). What I did was to create a matrix, where each column corresponds to one user in the LiveJournal dataset. Then each row corresponds to a moodid, which LJ lets users choose from a list when posting. The matrix has 135 rows, but rows 0, 50, and 94 are always zero because they don't actually correspond to a moodid for some reason. The entries are the number of times that a moodid was reported by an individual user.

The source code to generate this is in our github repo, at "process_data/featurizers.scala". The function "featurizeByUser" is what I ran.

I think you have access to mercury, the matrix I have described is stored there as a sparse matrix at /var/local/destress/featurized/moodsByUser.smat.lz4. If you want to know what each moodid corresponds to in English, there is a dictionary to perform this mapping at /var/local/destress/moodDict.sbmat. ... I'm not sure exactly what to expect from ICA, but my hope would be that it would sort of cluster the moodids in some way.

coryschillaci commented 9 years ago

I moved the data described in the above message to /var/local/destress/dataForICA/moodsByUser.smat.lz4 so it isn't mixed in with the bigger bag-of-words set.

DanielTakeshi commented 9 years ago

Closing this now because it appears to have been resolved. Also, note that I've pushed a file, moods_data.tar.gz onto my directory on mercury, /home/daniel, which has the data ICA generated, along with a README to make it crystal clear how I did things.