BIDData / BIDMach

CPU and GPU-accelerated Machine Learning Library
BSD 3-Clause "New" or "Revised" License
916 stars 168 forks source link

GPU memory guidelines #69

Closed danielunderwood closed 8 years ago

danielunderwood commented 8 years ago

Are there any guidelines on the amount of GPU memory necessary? I am currently trying the example given in the quickstart with the rcv1 data set and running into an out of memory error. My GPU is an older GTX 570 with 1.25 GB of memory and the data set is ~0.8GB. I was wanting to know if there are any estimates that would allow me to estimate the size of a data set that I could process with this card or for estimating the size of a data set that I could process on a given GPU with the memory. Furthermore, could a single data set be spread across multiple GPUs?

jcanny commented 8 years ago

Yes, the basic model memory needed is numfeatures * dims * sizeof(coefficient) so for RCV1-V2 its about 270k * 104 * 4 for float coeffs, or about 250 MB. There is an additional 250MB for the gradient, and then 250 MB more if you're using ADAGRAD. So that's 750 MB. There may be another 250 GB in tmp during ADAGRAD processing. That's cutting it very close. You can reduce minibatch size to reduces the size of non-model temporary variables, but its a lot to ask from that GPU. Newer Titans have 12 GB and way more power.

danielunderwood commented 8 years ago

I'm messing around with it a bit more and I seem to be getting a couple different errors. Using the basic classification tutorial, I get

Error 1

corpus perplexity=5582.125391
java.lang.RuntimeException: GMult: CUDA kernel error in CUMAT.applyop
  at BIDMat.GMat.gOp(GMat.scala:750)
  at BIDMat.GMat.$times$at(GMat.scala:1157)
  at BIDMat.GMat.$times$at(GMat.scala:16)
  at BIDMach.updaters.ADAGrad$$anonfun$init$1.apply$mcVI$sp(ADAGrad.scala:34)
  at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
  at BIDMach.updaters.ADAGrad.init(ADAGrad.scala:33)
  at BIDMach.Learner.init(Learner.scala:47)
  at BIDMach.Learner.train(Learner.scala:54)
  ... 33 elided

when calling mm.train for the first time. Subsequent calls will yield

Error 2

corpus perplexity=5582.125391
java.lang.RuntimeException: CUDA alloc failed out of memory
  at BIDMat.GMat$.apply(GMat.scala:1667)
  at BIDMat.GMat$.newOrCheckGMat(GMat.scala:2411)
  at BIDMat.GMat$.newOrCheckGMat(GMat.scala:2452)
  at BIDMat.GMat$.apply(GMat.scala:1673)
  at BIDMach.models.Model.convertMat(Model.scala:161)
  at BIDMach.models.GLM.init(GLM.scala:96)
  at BIDMach.Learner.init(Learner.scala:45)
  at BIDMach.Learner.train(Learner.scala:54)
  ... 33 elided

I then call resetGPU to clear the memory and call mm.train where I get the second error again, but then another mm.train call will cause the first error followed by the second error.

As far as memory usage, my GPU starts out as using ~300MB. The first call to mm.train raises this to ~900MB and the second call raises it to ~1200MB where I get the CUDA allocation failure and it no longer rises with subsequent calls. The call to resetGPU immediately lowers usage to ~400MB . The first call to mm.train raises memory usage to ~500MB and then the CUDA alloc fails due to being out of memory. A second call to mm.train after resetting the GPU results in the CUDA kernel error and raises memory usage to ~950MB. Another call then raises usage to ~1200MB where an out of memory error occurs again and doesn't rise on subsequent calls.

I'm not sure what's going on with the memory allocation errors when there is still remaining memory unless a larger chunk is being allocated than expected.

Could this have anything to do with using a 500 series GPU rather than 600+ series?

acrossen commented 8 years ago

Similar question around the relationship between max LDA model size and GPU RAM requirements. I have a topic model I built using Y! LDA based on 1M documents (3M features) with k=1500. The most I seem to be able to achieve with BIDMach using my GTX-980 w/4GB RAM is k=500 before running out of RAM.

In your LDA benchmarks vs Y! you talk about a 2B article data set w/256 topics (you didn't list the # of features, but the sample nytimes data has 102.6K). How can I train a model from billions of documents/millions of features and tens of thousands of topics using GPU(s)? Do the memory requirements scale across GPUs or must each single NVIDIA card have the minimum RAM to fit the whole model?

mcelvg commented 8 years ago

Experiencing similar errors running on Mac OS 10.9.5 with an NVIDIA GeForce GTX 675MX (1024 MB) and CUDA version 7.0.

Following BIDMach_basic_classification.ipynb, mm.train fails with:

attempt-1

Subsequent attempts, after calling resetGPU; Mat.clearCaches and repeating all preceding steps in the workbook, result in:

attempt2

I gradually increased the size of the training set and found the maximum input size to be ~ 200k features (using all categories.)

Sometimes I can successfully train on 230k features, but not consistently.

230k-feats-fail1

230k-feats-pass1

On another attempt:

230k-feats-pass2

230k-feats-fail2

jcanny commented 8 years ago

It looks like your GPU is being used to drive the graphics display as well as computing (30% memory used before doing anything must be display memory). That reduces compute memory and also makes the calculation more likely to fail during peaks in video memory use.

re: much larger models, we're working on model partitioning across a cluster and should have something up in a couple of months.