BIDS-collaborative / destress

Helping @peparedes with text analysis of livejournal data
ISC License
7 stars 2 forks source link

Merge dictionaries after xmltweet #21

Closed coryschillaci closed 9 years ago

coryschillaci commented 9 years ago

Splitting this out from #3.

For starters, let's just merge the dictionaries and trim them whenever the number of entries becomes too large (@jcanny suggested 1 million).

coryschillaci commented 9 years ago

Step 2 will be to deal with the noise from <base64> stuff. @lambdaloop may solve with #20 so that we don't need to worry downstream of xmltweet.

coryschillaci commented 9 years ago

Made a first stab at merging and trimming dictionaries in commits leading up to 16c41816cdd98d993ecc2cbdc1335a041d49c1b4

To run on full data set, use

/path/to/bidmach utils.scala
import utils._
combine_dicts("/var/local/destress/tokenized/fileList.txt","/var/local/destress/tokenized/")

For some reason it breaks when processing al.xml,

Processing tokenized files from aj.xml
Processing tokenized files from ak.xml
Processing tokenized files from al.xml
java.lang.ArrayIndexOutOfBoundsException
  at BIDMat.HMat$.readSomeInts(HMat.scala:128)
  at BIDMat.HMat$.loadIMat(HMat.scala:338)
  at BIDMat.HMat$.loadIMat(HMat.scala:344)
  at BIDMat.MatFunctions$.loadIMat(MatFunctions.scala:1635)
  at utils$$anonfun$combine_dicts$1.apply(<console>:116)
  at utils$$anonfun$combine_dicts$1.apply(<console>:112)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at utils$.combine_dicts(<console>:112)
  ... 33 elided
coryschillaci commented 9 years ago

The error is independent of how many files are processed before (including if it's the first one processed). I've found it also hangs on an.xml, ch.html

coryschillaci commented 9 years ago

Maybe we can't load in sbmats this large?

-rw-rw-r-- 1 schillaci destress 198M Mar  9 21:12 an_dict.sbmat
-rw-rw-r-- 1 schillaci destress 176M Mar  9 21:10 al_dict.sbmat
-rw-rw-r-- 1 schillaci destress  96M Mar  9 21:13 ar_dict.sbmat
-rw-rw-r-- 1 schillaci destress  64M Mar  9 21:13 as_dict.sbmat

Except it works on ca.xml, which is almost as big as ch

-rw-rw-r-- 1 schillaci destress 188M Mar  9 21:24 ch_dict.sbmat
-rw-rw-r-- 1 schillaci destress 187M Mar  9 21:22 ca_dict.sbmat
-rw-rw-r-- 1 schillaci destress 129M Mar  9 21:26 co_dict.sbmat
-rw-rw-r-- 1 schillaci destress 103M Mar  9 21:27 cr_dict.sbmat
coryschillaci commented 9 years ago

Looks like there is some problem with the .xml.imat files:

scala> var temp = loadIMat("/var/local/destress/tokenized/ch.xml.imat")
java.lang.ArrayIndexOutOfBoundsException
  at java.lang.System.arraycopy(Native Method)
  at BIDMat.HMat$.readSomeInts(HMat.scala:128)
  at BIDMat.HMat$.loadIMat(HMat.scala:338)
  at BIDMat.HMat$.loadIMat(HMat.scala:344)
  at BIDMat.MatFunctions$.loadIMat(MatFunctions.scala:1635)
  ... 33 elided

The ones I can't load are all the smallest .imat files:

-rw-rw-r-- 1 schillaci destress    16 Mar  9 21:10 al.xml.imat
-rw-rw-r-- 1 schillaci destress    16 Mar  9 21:12 an.xml.imat
-rw-rw-r-- 1 schillaci destress    16 Mar  9 21:24 ch.xml.imat
-rw-rw-r-- 1 schillaci destress    16 Mar  9 21:29 da.xml.imat
-rw-rw-r-- 1 schillaci destress    16 Mar  9 22:02 la.xml.imat
-rw-rw-r-- 1 schillaci destress    16 Mar  9 22:05 li.xml.imat
-rw-rw-r-- 1 schillaci destress    16 Mar  9 22:09 ma.xml.imat
-rw-rw-r-- 1 schillaci destress    16 Mar  9 22:12 mi.xml.imat
-rw-rw-r-- 1 schillaci destress    16 Mar  9 22:34 sa.xml.imat
-rw-rw-r-- 1 schillaci destress    16 Mar  9 22:37 sh.xml.imat
-rw-rw-r-- 1 schillaci destress    16 Mar  9 22:44 st.xml.imat
-rw-rw-r-- 1 schillaci destress    16 Mar  9 22:49 th.xml.imat
coryschillaci commented 9 years ago

The dictionaries have been merged, keeping the total entries less than 10^6. This required a trim threshold of 168. The counts and string are now in /var/local/destress/tokenized as masterDict.dmat and masterDict.sbmat, respectively.

jcanny commented 9 years ago

Those must be empty matrices (there are 4 words of metadata, i.e. 16 bytes). I'll change the reader so it can handle those.

-John

On 3/12/2015 7:46 PM, coryschillaci wrote:

Looks like there is some problem with the |.xml.imat| files:

scala> var temp = loadIMat("/var/local/destress/tokenized/ch.xml.imat") java.lang.ArrayIndexOutOfBoundsException at java.lang.System.arraycopy(Native Method) at BIDMat.HMat$.readSomeInts(HMat.scala:128) at BIDMat.HMat$.loadIMat(HMat.scala:338) at BIDMat.HMat$.loadIMat(HMat.scala:344) at BIDMat.MatFunctions$.loadIMat(MatFunctions.scala:1635) ... 33 elided

The ones I can't load are all the smallest .imat files:

-rw-rw-r-- 1 schillaci destress 16 Mar 9 21:10 al.xml.imat -rw-rw-r-- 1 schillaci destress 16 Mar 9 21:12 an.xml.imat -rw-rw-r-- 1 schillaci destress 16 Mar 9 21:24 ch.xml.imat -rw-rw-r-- 1 schillaci destress 16 Mar 9 21:29 da.xml.imat -rw-rw-r-- 1 schillaci destress 16 Mar 9 22:02 la.xml.imat -rw-rw-r-- 1 schillaci destress 16 Mar 9 22:05 li.xml.imat -rw-rw-r-- 1 schillaci destress 16 Mar 9 22:09 ma.xml.imat -rw-rw-r-- 1 schillaci destress 16 Mar 9 22:12 mi.xml.imat -rw-rw-r-- 1 schillaci destress 16 Mar 9 22:34 sa.xml.imat -rw-rw-r-- 1 schillaci destress 16 Mar 9 22:37 sh.xml.imat -rw-rw-r-- 1 schillaci destress 16 Mar 9 22:44 st.xml.imat -rw-rw-r-- 1 schillaci destress 16 Mar 9 22:49 th.xml.imat

— Reply to this email directly or view it on GitHub https://github.com/berkeley-dsc/destress/issues/21#issuecomment-78762068.

jcanny commented 9 years ago

Actually it doesnt look like a zero file size problem. Should those files be empty? Or should they be very large? The tokenizer uses normal ints (32bits) for indexing so it has a maximum output size of 2 GB - and Java has the same limitation. Should those files have been larger than that? If so they would need to be split into some smaller files.

-John

On 3/12/2015 7:46 PM, coryschillaci wrote:

Looks like there is some problem with the |.xml.imat| files:

scala> var temp = loadIMat("/var/local/destress/tokenized/ch.xml.imat") java.lang.ArrayIndexOutOfBoundsException at java.lang.System.arraycopy(Native Method) at BIDMat.HMat$.readSomeInts(HMat.scala:128) at BIDMat.HMat$.loadIMat(HMat.scala:338) at BIDMat.HMat$.loadIMat(HMat.scala:344) at BIDMat.MatFunctions$.loadIMat(MatFunctions.scala:1635) ... 33 elided

The ones I can't load are all the smallest .imat files:

-rw-rw-r-- 1 schillaci destress 16 Mar 9 21:10 al.xml.imat -rw-rw-r-- 1 schillaci destress 16 Mar 9 21:12 an.xml.imat -rw-rw-r-- 1 schillaci destress 16 Mar 9 21:24 ch.xml.imat -rw-rw-r-- 1 schillaci destress 16 Mar 9 21:29 da.xml.imat -rw-rw-r-- 1 schillaci destress 16 Mar 9 22:02 la.xml.imat -rw-rw-r-- 1 schillaci destress 16 Mar 9 22:05 li.xml.imat -rw-rw-r-- 1 schillaci destress 16 Mar 9 22:09 ma.xml.imat -rw-rw-r-- 1 schillaci destress 16 Mar 9 22:12 mi.xml.imat -rw-rw-r-- 1 schillaci destress 16 Mar 9 22:34 sa.xml.imat -rw-rw-r-- 1 schillaci destress 16 Mar 9 22:37 sh.xml.imat -rw-rw-r-- 1 schillaci destress 16 Mar 9 22:44 st.xml.imat -rw-rw-r-- 1 schillaci destress 16 Mar 9 22:49 th.xml.imat

— Reply to this email directly or view it on GitHub https://github.com/berkeley-dsc/destress/issues/21#issuecomment-78762068.

coryschillaci commented 9 years ago

I think this is the second case where the original file was too big. @lambdaloop do you have time to give us smaller concatenated files in two pieces for the cases listed above? If not, @anasrferreira or I can do it when we have time.

coryschillaci commented 9 years ago

Rerunning the concatenation into max 100Mb chunks. The concatenation script was changed in d614050c1105728e57d66c610206e2239993da9f