BIDS-collaborative / destress

Helping @peparedes with text analysis of livejournal data
ISC License
7 stars 2 forks source link

Run xmltweet on the data set #15

Closed cdschillaci closed 9 years ago

cdschillaci commented 9 years ago

@davclark Do we have disk space to run the tokenizer at this point? I uploaded a script into process_data/tokenize_files.sh which can be run with minimal changes (just select the input and output folders).

davclark commented 9 years ago

What's the ratio of input size to output size? On Mar 3, 2015 5:51 PM, "cdschillaci" notifications@github.com wrote:

@davclark https://github.com/davclark Do we have disk space to run the tokenizer at this point? I uploaded a script into process_data/tokenize_files.sh which can be run with minimal changes (just select the input and output folders).

Reply to this email directly or view it on GitHub https://github.com/berkeley-dsc/destress/issues/15.

cdschillaci commented 9 years ago

It's pretty close to 1:1 actually, so I guess definitely not enough space. Here's an example.

Raw:

-rw-rw-r-- 1 schillaci schillaci 1.8G Mar 3 19:37 ba.xml

Tokenized:

-rw-rw-r-- 1 schillaci schillaci 1.6G Mar 3 19:38 ba.xml.imat -rw-rw-r-- 1 schillaci schillaci 27M Mar 3 19:38 ba_dict.imat -rw-rw-r-- 1 schillaci schillaci 134M Mar 3 19:38 ba_dict.sbmat

davclark commented 9 years ago

Yeah - we could do a subset, but we are definitely lean on space. There is a 3TB drive coming via FedEx...

D

On Tue, Mar 3, 2015 at 9:01 PM, cdschillaci notifications@github.com wrote:

It's pretty close to 1:1 actually, so definitely not enough space. Here's an example.

Raw:

-rw-rw-r-- 1 schillaci schillaci 1.8G Mar 3 19:37 ba.xml

Tokenized:

-rw-rw-r-- 1 schillaci schillaci 1.6G Mar 3 19:38 ba.xml.imat -rw-rw-r-- 1 schillaci schillaci 27M Mar 3 19:38 ba_dict.imat -rw-rw-r-- 1 schillaci schillaci 134M Mar 3 19:38 ba_dict.sbmat

Reply to this email directly or view it on GitHub https://github.com/berkeley-dsc/destress/issues/15#issuecomment-77097581 .

anasrferreira commented 9 years ago

@davclark new xmltweet.exe is now compiled and moved to /var/local/destress/scripts/ @cdschillaci tokenize_files.sh has been updated with new xmltweet.exe path.

cdschillaci commented 9 years ago

@anasrferreira Is this your version or the new official BIDMach version?

anasrferreira commented 9 years ago

This is from the most recent BIDMach pull onto my mercury account. BIDMach doesn't come with xmltweet.exe. It needs to be compiled.

peparedes commented 9 years ago

Actually I think it does... I think it is in BIDMach/bin

P

On Wed, Mar 4, 2015 at 5:09 PM, anasrferreira notifications@github.com wrote:

This is from the most recent BIDMach pull onto my mercury account. BIDMach doesn't come with xmltweet.exe. It needs to be compiled.

— Reply to this email directly or view it on GitHub https://github.com/berkeley-dsc/destress/issues/15#issuecomment-77286345 .

davclark commented 9 years ago

I announced this on Slack - but just so we're clear, there's 2TB now on /var/local

coryschillaci commented 9 years ago

Running xmltweet on everything now.

davclark commented 9 years ago

Whee!

On Mon, Mar 9, 2015 at 9:13 PM, coryschillaci notifications@github.com wrote:

Running xmltweet on everything now.

Reply to this email directly or view it on GitHub https://github.com/berkeley-dsc/destress/issues/15#issuecomment-77993201 .

coryschillaci commented 9 years ago

Ran in 115 minutes. The dictionaries etc. have been output to /var/local/destress/tokenized

peparedes commented 9 years ago

Awesome!

P On Mar 9, 2015 11:44 PM, "coryschillaci" notifications@github.com wrote:

Ran in 115 minutes. The dictionaries etc. have been output to /var/local/destress/tokenized

— Reply to this email directly or view it on GitHub https://github.com/berkeley-dsc/destress/issues/15#issuecomment-78003522 .