Closed cdschillaci closed 9 years ago
What's the ratio of input size to output size? On Mar 3, 2015 5:51 PM, "cdschillaci" notifications@github.com wrote:
@davclark https://github.com/davclark Do we have disk space to run the tokenizer at this point? I uploaded a script into process_data/tokenize_files.sh which can be run with minimal changes (just select the input and output folders).
Reply to this email directly or view it on GitHub https://github.com/berkeley-dsc/destress/issues/15.
It's pretty close to 1:1 actually, so I guess definitely not enough space. Here's an example.
Raw:
-rw-rw-r-- 1 schillaci schillaci 1.8G Mar 3 19:37 ba.xml
Tokenized:
-rw-rw-r-- 1 schillaci schillaci 1.6G Mar 3 19:38 ba.xml.imat -rw-rw-r-- 1 schillaci schillaci 27M Mar 3 19:38 ba_dict.imat -rw-rw-r-- 1 schillaci schillaci 134M Mar 3 19:38 ba_dict.sbmat
Yeah - we could do a subset, but we are definitely lean on space. There is a 3TB drive coming via FedEx...
D
On Tue, Mar 3, 2015 at 9:01 PM, cdschillaci notifications@github.com wrote:
It's pretty close to 1:1 actually, so definitely not enough space. Here's an example.
Raw:
-rw-rw-r-- 1 schillaci schillaci 1.8G Mar 3 19:37 ba.xml
Tokenized:
-rw-rw-r-- 1 schillaci schillaci 1.6G Mar 3 19:38 ba.xml.imat -rw-rw-r-- 1 schillaci schillaci 27M Mar 3 19:38 ba_dict.imat -rw-rw-r-- 1 schillaci schillaci 134M Mar 3 19:38 ba_dict.sbmat
Reply to this email directly or view it on GitHub https://github.com/berkeley-dsc/destress/issues/15#issuecomment-77097581 .
@davclark new xmltweet.exe is now compiled and moved to /var/local/destress/scripts/ @cdschillaci tokenize_files.sh has been updated with new xmltweet.exe path.
@anasrferreira Is this your version or the new official BIDMach version?
This is from the most recent BIDMach pull onto my mercury account. BIDMach doesn't come with xmltweet.exe. It needs to be compiled.
Actually I think it does... I think it is in BIDMach/bin
P
On Wed, Mar 4, 2015 at 5:09 PM, anasrferreira notifications@github.com wrote:
This is from the most recent BIDMach pull onto my mercury account. BIDMach doesn't come with xmltweet.exe. It needs to be compiled.
— Reply to this email directly or view it on GitHub https://github.com/berkeley-dsc/destress/issues/15#issuecomment-77286345 .
I announced this on Slack - but just so we're clear, there's 2TB now on /var/local
Running xmltweet on everything now.
Whee!
On Mon, Mar 9, 2015 at 9:13 PM, coryschillaci notifications@github.com wrote:
Running xmltweet on everything now.
Reply to this email directly or view it on GitHub https://github.com/berkeley-dsc/destress/issues/15#issuecomment-77993201 .
Ran in 115 minutes. The dictionaries etc. have been output to /var/local/destress/tokenized
Awesome!
P On Mar 9, 2015 11:44 PM, "coryschillaci" notifications@github.com wrote:
Ran in 115 minutes. The dictionaries etc. have been output to /var/local/destress/tokenized
— Reply to this email directly or view it on GitHub https://github.com/berkeley-dsc/destress/issues/15#issuecomment-78003522 .
@davclark Do we have disk space to run the tokenizer at this point? I uploaded a script into process_data/tokenize_files.sh which can be run with minimal changes (just select the input and output folders).