BIDS-collaborative / destress

Helping @peparedes with text analysis of livejournal data
ISC License
7 stars 2 forks source link

Generate ML dataset with "ground truth" emotions corresponding to posts #3

Closed davclark closed 9 years ago

davclark commented 9 years ago

@coryschillaci, it sounds like this might be a good first task for you to get us moving forward on our first actual ML tasks. But just say so if you're not interested.

davclark commented 9 years ago

Currently @coryschillaci and @anasrferreira are working on generating a useful subset of emotions from the full set of LJ tags.

davclark commented 9 years ago

@jcanny suggests that it would be good to concatenate xml files into 10MB+ files (for efficiency reasons). Ideal grain size for ML is probably ~1GB.

@lambdaloop offered to take this on.

coryschillaci commented 9 years ago

@helgammal Agreed to work on code to merge the dictionaries built from running xmltweet on subsets of the data.

lambdaloop commented 9 years ago

I've concatenated the xml files by folder. The combined files are in ~pierre/combined/events on mercury.

I've also created a file with a list of all of the files, located at ~pierre/combined/files.txt I heard it may be useful.

I've pushed the scripts used to combine and list files into clean_data in this repo.

peparedes commented 9 years ago

Excellent Pierre!

Thanks,

Pablo On Mar 3, 2015 12:50 AM, "Pierre Karashchuk" notifications@github.com wrote:

I've concatenated the xml files by folder. The combined files are in ~pierre/combined/events on mercury.

I've also created a file with a list of all of the files, located at ~pierre/combined/files.txt I heard it may be useful.

— Reply to this email directly or view it on GitHub https://github.com/berkeley-dsc/destress/issues/3#issuecomment-76907741.

davclark commented 9 years ago

:+1: and we've still got a healthy 115G to spare... I've got a 3TB drive on order, so we shouldn't have as much of an issue with space in the near future.

Shall I move these into the shared git-annex area?

cdschillaci commented 9 years ago

@davclark It would probably be good to have them in the git-annex.

We still have some permissions issues, since when running xmltweet it includes the file path of the input files in some of the output filepaths. This makes sending all of the output to a personal folder a little tricky. @anasrferreira is investigating tweaking the source of xmltweet from the BIDMach repo. Alternately we can just run xmltweet from the folder containing the concatenated files.

davclark commented 9 years ago

I'm finding that cloning the git repository (not even copying any files with git annex yet) is unacceptably slow. I'm thinking I'll create a new git annex with just these new files.

That said, @anasrferreira appears to be working right now, so I'll wait to do this later. At this point, maybe I'll just do it at the meeting tomorrow.

anasrferreira commented 9 years ago

@davclark and @cdschillaci I've changed newparse.cpp to strip out the path for the input file if an output path is given. The new xmltweet2.exe is compiled in my home folder (/home/xmltweet2.exe).

davclark commented 9 years ago

@anasrferreira please submit a pull request to @jcanny, per the above issue!

jcanny commented 9 years ago

Sorry I was slow with this. I pushed a similar change to the master branch. It compiles but I havent had a chance to test it. Can you check it out?

-John

On 3/3/2015 3:53 PM, anasrferreira wrote:

@davclark https://github.com/davclark and @cdschillaci https://github.com/cdschillaci I've changed newparse.cpp to strip out the path for the input file if an output path is given. The new xmltweet2.exe is compiled in my home folder (/home/xmltweet2.exe).

— Reply to this email directly or view it on GitHub https://github.com/berkeley-dsc/destress/issues/3#issuecomment-77065677.

jcanny commented 9 years ago

Today, Folks I'm volunteering at my kids school so cant make the meeting today. But I would recommend a very relevant talk by Dean Eckles in 330 Blum Hall from 3-4pm today. Dean is at facebook and is doing both real and observational studies (the kind we want to do with existing data) on Facebook's datasets.

-John

anasrferreira commented 9 years ago

@jcanny xmltweet seems to work.

coryschillaci commented 9 years ago

Met today with @coryschillaci @anasrferreira @lambdaloop

@lambdaloop Will look at making sure that xmltweet parses closing properly

@anasrferreira Will work on turning parsed imat/sbmat files into feature vectors

Proposed Featurization

@coryschillaci Will investigate whether it's better to merge dictionaries before or after featurizing procedure, and implement something in the latter case.

coryschillaci commented 9 years ago

@helgammal Have you tried merging all of the dictionaries at once using this? I'm wondering if the full merged dictionary is too big to deal with in this way.

We can save some disk space and later processing time if we do the index update now with a trim. I noticed that the big dicts, for example, are about 80% one-time entries. The Dict class has a method to Trim:

def trim(thresh:Int):Dict = {
    val ii = find(counts >= thresh.toDouble)
    Dict(cstr(ii), DMat(counts(ii)))
} 

We should discuss what the trim threshold needs to be, but at a first pass something around 100 on the full dataset is probably safe enough.

helgammal commented 9 years ago

@coryschillaci : I haven't actually tried merging all of the data but I think your concern is quite valid; another concern is whether keeping this in memory is a good idea or whether we'd need to write it to disk periodically. Trimming makes a lot of sense, too.

coryschillaci commented 9 years ago

@anasrferreira Has finished working on a scala script to featurize one xml file at a time dca678a478272e955b330ff60d332db489c774b7.

coryschillaci commented 9 years ago

I updated the featurizer code to work with the new master dictionary in commit 91f398ba9340954e9988db28469e43ec1ef06775

Next step is to modify code so that we can featurize all the data files. @anasrferreira @coryschillaci

davclark commented 9 years ago

So is this closeable?

coryschillaci commented 9 years ago

Almost! I will close once I have a clean output.

coryschillaci commented 9 years ago

Another to do on this issue: organize bag of words features by their frequency.

coryschillaci commented 9 years ago

A dataset with moodid tags is now available in /var/local/destress/featurized/. The posts are saved in 320 batches of 100,000 (except the last batch which is 70,384 posts).