Closed coryschillaci closed 9 years ago
Hi, I noticed in the code for dataparser taht you're concatenating data to matrices on lines 102 and 113. If you think about it, what that's doing is taking whatever matrix you have for "labels" or "sBagOfWords", creating a new matrix one element longer, and then discarding the old matrix. There's no efficient way to do that (e.g. C++ vector requires you to turn the same object with an element appended, not a new object). That will take quadratic time, and consume an enormous amount of memory. The simple way to do this is matrices is to make a matrix large enough to hold everything and maintain a pointer (say nextLabel) to the next "empty" location. When you're done filling it, just create a new matrix (once) as "labels(0->nextLabel")
One line 97 you do a string comparison with the Dict lookup of an integer each time instead of (once) doing a lookup of "
I would avoid ever concatenating sparse matrices in a loop. Instead repsent the sparse elements with three matrices "rows" and "cols" (nx IMat) and "vals" (nx1 FMat) with values. add to those matrices using the method above. When you're done, call sparse(rows(0->size),cols(0->size),vals(0,size),maxrows, maxcols).
With a bit of practice it should be possible to write that routine without looping over word locations (which allows it to run fast on a GPU). But I know that's a radically different way of thinking.
I made all of the suggested changes, but was still running out of memory when running as a script. These changes were committed as part of c5dfb8ced5068b93351d2d18faa24618ab067a43
After talking with @jcanny, we decided it would be best to try compiling the code to see if that would fix the problem. As such, I made changes so that everything can be compiled and then run as functions from scripts. It took me some time to figure out a good way to compile in Mercury, I settled on the following procedure with some guidance:
wget
to acquire a compiled distribution, then copying the lib/
folder contents from the compiled version to your github version. You can now delete the version not linked to github.BIDMach/src/main/scala/destress/
for the files in your destress repository that you want to compile using ln -s /path/to/file /path/to/symlink
. The location of the symlinks doesn't matter too much.sbt package
from main BIDMach folder. If it builds without errors, copy BIDMach/target/scala-<version>/bidmach-<version>.jar
to BIDMach/BIDMach.jar
.BIDMach/bidmach
the compiled stuff will be available after import. As a concrete example, here is what I do to run the featurizer function:
schillaci@mercury:~/BIDMach$ sbt package
[info] Set current project to BIDMach (in build file:/home/schillaci/BIDMach/)
[warn] Credentials file /home/schillaci/.ivy2/.credentials does not exist
[success] Total time: 1 s, completed Apr 1, 2015 10:35:18 AM
schillaci@mercury:~/BIDMach$ sbt clean
[info] Set current project to BIDMach (in build file:/home/schillaci/BIDMach/)
[success] Total time: 0 s, completed Apr 1, 2015 10:35:27 AM
schillaci@mercury:~/BIDMach$ sbt package
[info] Set current project to BIDMach (in build file:/home/schillaci/BIDMach/)
[warn] Credentials file /home/schillaci/.ivy2/.credentials does not exist
[info] Updating {file:/home/schillaci/BIDMach/}bidmach...
[warn] Binary version (2.10) for dependency org.scala-lang#jline;2.10.3
[warn] in edu.berkeley.bid#bidmach_2.11;1.0.1 differs from Scala binary version in project (2.11).
[info] Resolving jline#jline;2.12 ...
[info] Done updating.
[info] Compiling 45 Scala sources and 8 Java sources to /home/schillaci/BIDMach/target/scala-2.11/classes...
[info] Packaging /home/schillaci/BIDMach/target/scala-2.11/bidmach_2.11-1.0.1.jar ...
[info] Done packaging.
[success] Total time: 27 s, completed Apr 1, 2015 10:36:01 AM
schillaci@mercury:~/BIDMach$ cp target/scala-2.11/bidmach_2.11-1.0.1.jar ./BIDMach.jar
schillaci@mercury:~/BIDMach$ ./bidmach
Loading /home/schillaci/BIDMach/lib/bidmach_init.scala...
import BIDMat.{CMat, CSMat, DMat, Dict, FMat, FND, GMat, GDMat, GIMat, GLMat, GSMat, GSDMat, HMat, IDict, Image, IMat, LMat, Mat, SMat, SBMat, SDMat}
import BIDMat.MatFunctions._
import BIDMat.SciFunctions._
import BIDMat.Solvers._
import BIDMat.Plotting._
import BIDMach.Learner
import BIDMach.models.{DNN, FM, GLM, KMeans, KMeansw, LDA, LDAgibbs, Model, NMF, SFA, RandomForest}
import BIDMach.datasources.{DataSource, MatDS, FilesDS, SFilesDS}
import BIDMach.mixins.{CosineSim, Perplexity, Top, L1Regularizer, L2Regularizer}
import BIDMach.updaters.{ADAGrad, Batch, BatchNorm, IncMult, IncNorm, Telescoping}
import BIDMach.causal.IPTW
Couldnt load JCuda
Welcome to Scala version 2.11.2 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_76).
Type in expressions to have them evaluated.
Type :help for more information.
scala> import featurizers._
import featurizers._
scala> featurizeMoodID("/var/local/destress/tokenized2/","/var/local/destress/tokenized2/","/var/local/destress/featurized/","/var/local/destress/tokenized2/fileList.txt")
Currently featurizing 00.xml
Featurized in 1.308s
Currently featurizing 01.xml
Featurized in 0.201s
Currently featurizing 02.xml
Featurized in 0.076s
I managed to run the featurizeMoodID on all of the combined files. There are a few weird errors that I noticed:
</posts>
tag. This needs to be looked into. For now, I just manually added the missing closing tag in the appropriate concatenated file./var/local/destress/lj-annex/data/events/he/
contains some weird stuff. In 325fe3e161b41514851b14b5fdd34742352cbf99 I changed the combine_data.sh script to only concatenate *.xml
files, which makes sure these don't get concatenated in with the good stuff.
</posts>
at the end: heatherbell.xml~
and helenabucket.xml~
simplelogger-0-0.log
and simplelogger-0-0.log.clk
To get jcuda to work properly on Mercury, edit two lines in the ./bidmach
script as:
export DYLD_LIBRARY_PATH="${LIBDIR}:/usr/local/cuda/lib:${DYLD_LIBRARY_PATH}
---> export DYLD_LIBRARY_PATH="${LIBDIR}:/usr/local/cuda-6.5/lib:${DYLD_LIBRARY_PATH}
export LD_LIBRARY_PATH="${LIBDIR}:/usr/local/cuda/lib:${LD_LIBRARY_PATH}
---> export LD_LIBRARY_PATH="${LIBDIR}:/usr/local/cuda-6.5/lib64:${LD_LIBRARY_PATH}
In the last week before break @anasrferreira and @coryschillaci reprocessed the raw data into smaller chunks (max ~100Mb), tokenized without any of the
<base64>
stuff included (thanks to @lambdaloop).The raw concatenated xml files are currently in
/var/local/destress/combined2/events/
. These are concatenated using the script/path/to/destress/clean_data/combine_data.sh
. For some reason, a few of the biggest ones are over 100Mb, but I'm not sure the bug is worth chasing down. Here are the largest:After tokenizing with
/path/to/destress/process_data/tokenize.sh
, the various output files are all stored in/var/local/destress/tokenized2/
The biggest IMat files after tokenizing areHowever, when @anasrferreira and I have tried running the
dataparser.scala
code to generate features from the tokenized data, Mercury runs out of memory and then of course slows way down. Does anybody have an idea of why this might be happening? I don't see why it should be using almost 16Gb of memory to process 100Mb or smaller files into 25Mb output chunks.