Closed davclark closed 9 years ago
Currently @coryschillaci and @anasrferreira are working on generating a useful subset of emotions from the full set of LJ tags.
@jcanny suggests that it would be good to concatenate xml files into 10MB+ files (for efficiency reasons). Ideal grain size for ML is probably ~1GB.
@lambdaloop offered to take this on.
@helgammal Agreed to work on code to merge the dictionaries built from running xmltweet on subsets of the data.
I've concatenated the xml files by folder. The combined files are in ~pierre/combined/events on mercury.
I've also created a file with a list of all of the files, located at ~pierre/combined/files.txt I heard it may be useful.
I've pushed the scripts used to combine and list files into clean_data in this repo.
Excellent Pierre!
Thanks,
Pablo On Mar 3, 2015 12:50 AM, "Pierre Karashchuk" notifications@github.com wrote:
I've concatenated the xml files by folder. The combined files are in ~pierre/combined/events on mercury.
I've also created a file with a list of all of the files, located at ~pierre/combined/files.txt I heard it may be useful.
— Reply to this email directly or view it on GitHub https://github.com/berkeley-dsc/destress/issues/3#issuecomment-76907741.
:+1: and we've still got a healthy 115G to spare... I've got a 3TB drive on order, so we shouldn't have as much of an issue with space in the near future.
Shall I move these into the shared git-annex area?
@davclark It would probably be good to have them in the git-annex.
We still have some permissions issues, since when running xmltweet it includes the file path of the input files in some of the output filepaths. This makes sending all of the output to a personal folder a little tricky. @anasrferreira is investigating tweaking the source of xmltweet from the BIDMach repo. Alternately we can just run xmltweet from the folder containing the concatenated files.
I'm finding that cloning the git repository (not even copying any files with git annex yet) is unacceptably slow. I'm thinking I'll create a new git annex with just these new files.
That said, @anasrferreira appears to be working right now, so I'll wait to do this later. At this point, maybe I'll just do it at the meeting tomorrow.
@davclark and @cdschillaci I've changed newparse.cpp to strip out the path for the input file if an output path is given. The new xmltweet2.exe is compiled in my home folder (/home/xmltweet2.exe).
@anasrferreira please submit a pull request to @jcanny, per the above issue!
Sorry I was slow with this. I pushed a similar change to the master branch. It compiles but I havent had a chance to test it. Can you check it out?
-John
On 3/3/2015 3:53 PM, anasrferreira wrote:
@davclark https://github.com/davclark and @cdschillaci https://github.com/cdschillaci I've changed newparse.cpp to strip out the path for the input file if an output path is given. The new xmltweet2.exe is compiled in my home folder (/home/xmltweet2.exe).
— Reply to this email directly or view it on GitHub https://github.com/berkeley-dsc/destress/issues/3#issuecomment-77065677.
Today, Folks I'm volunteering at my kids school so cant make the meeting today. But I would recommend a very relevant talk by Dean Eckles in 330 Blum Hall from 3-4pm today. Dean is at facebook and is doing both real and observational studies (the kind we want to do with existing data) on Facebook's datasets.
-John
@jcanny xmltweet seems to work.
Met today with @coryschillaci @anasrferreira @lambdaloop
@lambdaloop Will look at making sure that xmltweet parses closing properly
@anasrferreira Will work on turning parsed imat/sbmat files into feature vectors
Proposed Featurization
@coryschillaci Will investigate whether it's better to merge dictionaries before or after featurizing procedure, and implement something in the latter case.
@helgammal Have you tried merging all of the dictionaries at once using this? I'm wondering if the full merged dictionary is too big to deal with in this way.
We can save some disk space and later processing time if we do the index update now with a trim. I noticed that the big dicts, for example, are about 80% one-time entries. The Dict class has a method to Trim
:
def trim(thresh:Int):Dict = {
val ii = find(counts >= thresh.toDouble)
Dict(cstr(ii), DMat(counts(ii)))
}
We should discuss what the trim threshold needs to be, but at a first pass something around 100 on the full dataset is probably safe enough.
@coryschillaci : I haven't actually tried merging all of the data but I think your concern is quite valid; another concern is whether keeping this in memory is a good idea or whether we'd need to write it to disk periodically. Trimming makes a lot of sense, too.
@anasrferreira Has finished working on a scala script to featurize one xml file at a time dca678a478272e955b330ff60d332db489c774b7.
I updated the featurizer code to work with the new master dictionary in commit 91f398ba9340954e9988db28469e43ec1ef06775
Next step is to modify code so that we can featurize all the data files. @anasrferreira @coryschillaci
So is this closeable?
Almost! I will close once I have a clean output.
Another to do on this issue: organize bag of words features by their frequency.
A dataset with moodid tags is now available in /var/local/destress/featurized/
. The posts are saved in 320 batches of 100,000 (except the last batch which is 70,384 posts).
@coryschillaci, it sounds like this might be a good first task for you to get us moving forward on our first actual ML tasks. But just say so if you're not interested.