BIDS-collaborative / destress

Helping @peparedes with text analysis of livejournal data
ISC License
7 stars 2 forks source link

Memory problems in dataparser.scala #25

Closed coryschillaci closed 9 years ago

coryschillaci commented 9 years ago

In the last week before break @anasrferreira and @coryschillaci reprocessed the raw data into smaller chunks (max ~100Mb), tokenized without any of the <base64> stuff included (thanks to @lambdaloop).

The raw concatenated xml files are currently in /var/local/destress/combined2/events/. These are concatenated using the script /path/to/destress/clean_data/combine_data.sh. For some reason, a few of the biggest ones are over 100Mb, but I'm not sure the bug is worth chasing down. Here are the largest:

-rw-rw-r-- 1 schillaci destress  164M Mar 18 04:03 ma12.xml
-rw-rw-r-- 1 schillaci destress  198M Mar 18 07:29 vs.xml
-rw-rw-r-- 1 schillaci destress  201M Mar 18 02:16 go9.xml
-rw-rw-r-- 1 schillaci destress  203M Mar 18 07:24 va7.xml
-rw-rw-r-- 1 schillaci destress  216M Mar 18 06:15 sk6.xml
-rw-rw-r-- 1 schillaci destress  240M Mar 18 05:00 os.xml
-rw-rw-r-- 1 schillaci destress  286M Mar 18 05:47 ru6.xml
-rw-rw-r-- 1 schillaci destress  399M Mar 18 05:43 ro7.xml
-rw-rw-r-- 1 schillaci destress  401M Mar 18 05:38 re15.xml
-rw-rw-r-- 1 schillaci destress  574M Mar 18 01:01 cu2.xml

After tokenizing with /path/to/destress/process_data/tokenize.sh, the various output files are all stored in /var/local/destress/tokenized2/The biggest IMat files after tokenizing are

-rw-rw-r-- 1 schillaci destress   85M Mar 18 22:36 xo.xml.imat
-rw-rw-r-- 1 schillaci destress   87M Mar 18 21:46 ju.xml.imat
-rw-rw-r-- 1 schillaci destress   93M Mar 18 21:57 ma.xml.imat
-rw-rw-r-- 1 schillaci destress  101M Mar 18 22:22 sp7.xml.imat
-rw-rw-r-- 1 schillaci destress  161M Mar 18 22:33 vs.xml.imat
-rw-rw-r-- 1 schillaci destress  427M Mar 18 21:24 cu2.xml.imat

However, when @anasrferreira and I have tried running the dataparser.scala code to generate features from the tokenized data, Mercury runs out of memory and then of course slows way down. Does anybody have an idea of why this might be happening? I don't see why it should be using almost 16Gb of memory to process 100Mb or smaller files into 25Mb output chunks.

jcanny commented 9 years ago

Hi, I noticed in the code for dataparser taht you're concatenating data to matrices on lines 102 and 113. If you think about it, what that's doing is taking whatever matrix you have for "labels" or "sBagOfWords", creating a new matrix one element longer, and then discarding the old matrix. There's no efficient way to do that (e.g. C++ vector requires you to turn the same object with an element appended, not a new object). That will take quadratic time, and consume an enormous amount of memory. The simple way to do this is matrices is to make a matrix large enough to hold everything and maintain a pointer (say nextLabel) to the next "empty" location. When you're done filling it, just create a new matrix (once) as "labels(0->nextLabel")

One line 97 you do a string comparison with the Dict lookup of an integer each time instead of (once) doing a lookup of "" and then comparing two integers (constant time).

I would avoid ever concatenating sparse matrices in a loop. Instead repsent the sparse elements with three matrices "rows" and "cols" (nx IMat) and "vals" (nx1 FMat) with values. add to those matrices using the method above. When you're done, call sparse(rows(0->size),cols(0->size),vals(0,size),maxrows, maxcols).

With a bit of practice it should be possible to write that routine without looping over word locations (which allows it to run fast on a GPU). But I know that's a radically different way of thinking.

coryschillaci commented 9 years ago

I made all of the suggested changes, but was still running out of memory when running as a script. These changes were committed as part of c5dfb8ced5068b93351d2d18faa24618ab067a43

After talking with @jcanny, we decided it would be best to try compiling the code to see if that would fix the problem. As such, I made changes so that everything can be compiled and then run as functions from scripts. It took me some time to figure out a good way to compile in Mercury, I settled on the following procedure with some guidance:

  1. Setup your own BIDMach on mercury. This can be done by cloning the github repository, using wget to acquire a compiled distribution, then copying the lib/ folder contents from the compiled version to your github version. You can now delete the version not linked to github.
  2. Create symlinks in BIDMach/src/main/scala/destress/ for the files in your destress repository that you want to compile using ln -s /path/to/file /path/to/symlink. The location of the symlinks doesn't matter too much.
  3. Run sbt package from main BIDMach folder. If it builds without errors, copy BIDMach/target/scala-<version>/bidmach-<version>.jar to BIDMach/BIDMach.jar.
  4. Now when you run BIDMach/bidmach the compiled stuff will be available after import.

As a concrete example, here is what I do to run the featurizer function:

schillaci@mercury:~/BIDMach$ sbt package
[info] Set current project to BIDMach (in build file:/home/schillaci/BIDMach/)
[warn] Credentials file /home/schillaci/.ivy2/.credentials does not exist
[success] Total time: 1 s, completed Apr 1, 2015 10:35:18 AM
schillaci@mercury:~/BIDMach$ sbt clean
[info] Set current project to BIDMach (in build file:/home/schillaci/BIDMach/)
[success] Total time: 0 s, completed Apr 1, 2015 10:35:27 AM
schillaci@mercury:~/BIDMach$ sbt package
[info] Set current project to BIDMach (in build file:/home/schillaci/BIDMach/)
[warn] Credentials file /home/schillaci/.ivy2/.credentials does not exist
[info] Updating {file:/home/schillaci/BIDMach/}bidmach...
[warn] Binary version (2.10) for dependency org.scala-lang#jline;2.10.3
[warn]  in edu.berkeley.bid#bidmach_2.11;1.0.1 differs from Scala binary version in project (2.11).
[info] Resolving jline#jline;2.12 ...
[info] Done updating.
[info] Compiling 45 Scala sources and 8 Java sources to /home/schillaci/BIDMach/target/scala-2.11/classes...
[info] Packaging /home/schillaci/BIDMach/target/scala-2.11/bidmach_2.11-1.0.1.jar ...
[info] Done packaging.
[success] Total time: 27 s, completed Apr 1, 2015 10:36:01 AM
schillaci@mercury:~/BIDMach$ cp target/scala-2.11/bidmach_2.11-1.0.1.jar ./BIDMach.jar
schillaci@mercury:~/BIDMach$ ./bidmach
Loading /home/schillaci/BIDMach/lib/bidmach_init.scala...
import BIDMat.{CMat, CSMat, DMat, Dict, FMat, FND, GMat, GDMat, GIMat, GLMat, GSMat, GSDMat, HMat, IDict, Image, IMat, LMat, Mat, SMat, SBMat, SDMat}
import BIDMat.MatFunctions._
import BIDMat.SciFunctions._
import BIDMat.Solvers._
import BIDMat.Plotting._
import BIDMach.Learner
import BIDMach.models.{DNN, FM, GLM, KMeans, KMeansw, LDA, LDAgibbs, Model, NMF, SFA, RandomForest}
import BIDMach.datasources.{DataSource, MatDS, FilesDS, SFilesDS}
import BIDMach.mixins.{CosineSim, Perplexity, Top, L1Regularizer, L2Regularizer}
import BIDMach.updaters.{ADAGrad, Batch, BatchNorm, IncMult, IncNorm, Telescoping}
import BIDMach.causal.IPTW
Couldnt load JCuda

Welcome to Scala version 2.11.2 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_76).
Type in expressions to have them evaluated.
Type :help for more information.

scala> import featurizers._
import featurizers._

scala> featurizeMoodID("/var/local/destress/tokenized2/","/var/local/destress/tokenized2/","/var/local/destress/featurized/","/var/local/destress/tokenized2/fileList.txt")
Currently featurizing 00.xml
Featurized in 1.308s
Currently featurizing 01.xml
Featurized in 0.201s
Currently featurizing 02.xml
Featurized in 0.076s

I managed to run the featurizeMoodID on all of the combined files. There are a few weird errors that I noticed:

coryschillaci commented 9 years ago

To get jcuda to work properly on Mercury, edit two lines in the ./bidmach script as:

export DYLD_LIBRARY_PATH="${LIBDIR}:/usr/local/cuda/lib:${DYLD_LIBRARY_PATH} ---> export DYLD_LIBRARY_PATH="${LIBDIR}:/usr/local/cuda-6.5/lib:${DYLD_LIBRARY_PATH}

export LD_LIBRARY_PATH="${LIBDIR}:/usr/local/cuda/lib:${LD_LIBRARY_PATH} ---> export LD_LIBRARY_PATH="${LIBDIR}:/usr/local/cuda-6.5/lib64:${LD_LIBRARY_PATH}