Closed danielunderwood closed 8 years ago
Yes, the basic model memory needed is numfeatures * dims * sizeof(coefficient) so for RCV1-V2 its about 270k * 104 * 4 for float coeffs, or about 250 MB. There is an additional 250MB for the gradient, and then 250 MB more if you're using ADAGRAD. So that's 750 MB. There may be another 250 GB in tmp during ADAGRAD processing. That's cutting it very close. You can reduce minibatch size to reduces the size of non-model temporary variables, but its a lot to ask from that GPU. Newer Titans have 12 GB and way more power.
I'm messing around with it a bit more and I seem to be getting a couple different errors. Using the basic classification tutorial, I get
Error 1
corpus perplexity=5582.125391
java.lang.RuntimeException: GMult: CUDA kernel error in CUMAT.applyop
at BIDMat.GMat.gOp(GMat.scala:750)
at BIDMat.GMat.$times$at(GMat.scala:1157)
at BIDMat.GMat.$times$at(GMat.scala:16)
at BIDMach.updaters.ADAGrad$$anonfun$init$1.apply$mcVI$sp(ADAGrad.scala:34)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
at BIDMach.updaters.ADAGrad.init(ADAGrad.scala:33)
at BIDMach.Learner.init(Learner.scala:47)
at BIDMach.Learner.train(Learner.scala:54)
... 33 elided
when calling mm.train
for the first time. Subsequent calls will yield
Error 2
corpus perplexity=5582.125391
java.lang.RuntimeException: CUDA alloc failed out of memory
at BIDMat.GMat$.apply(GMat.scala:1667)
at BIDMat.GMat$.newOrCheckGMat(GMat.scala:2411)
at BIDMat.GMat$.newOrCheckGMat(GMat.scala:2452)
at BIDMat.GMat$.apply(GMat.scala:1673)
at BIDMach.models.Model.convertMat(Model.scala:161)
at BIDMach.models.GLM.init(GLM.scala:96)
at BIDMach.Learner.init(Learner.scala:45)
at BIDMach.Learner.train(Learner.scala:54)
... 33 elided
I then call resetGPU
to clear the memory and call mm.train
where I get the second error again, but then another mm.train
call will cause the first error followed by the second error.
As far as memory usage, my GPU starts out as using ~300MB. The first call to mm.train
raises this to ~900MB and the second call raises it to ~1200MB where I get the CUDA allocation failure and it no longer rises with subsequent calls. The call to resetGPU
immediately lowers usage to ~400MB . The first call to mm.train
raises memory usage to ~500MB and then the CUDA alloc fails due to being out of memory. A second call to mm.train
after resetting the GPU results in the CUDA kernel error and raises memory usage to ~950MB. Another call then raises usage to ~1200MB where an out of memory error occurs again and doesn't rise on subsequent calls.
I'm not sure what's going on with the memory allocation errors when there is still remaining memory unless a larger chunk is being allocated than expected.
Could this have anything to do with using a 500 series GPU rather than 600+ series?
Similar question around the relationship between max LDA model size and GPU RAM requirements. I have a topic model I built using Y! LDA based on 1M documents (3M features) with k=1500. The most I seem to be able to achieve with BIDMach using my GTX-980 w/4GB RAM is k=500 before running out of RAM.
In your LDA benchmarks vs Y! you talk about a 2B article data set w/256 topics (you didn't list the # of features, but the sample nytimes data has 102.6K). How can I train a model from billions of documents/millions of features and tens of thousands of topics using GPU(s)? Do the memory requirements scale across GPUs or must each single NVIDIA card have the minimum RAM to fit the whole model?
Experiencing similar errors running on Mac OS 10.9.5 with an NVIDIA GeForce GTX 675MX (1024 MB) and CUDA version 7.0.
Following BIDMach_basic_classification.ipynb, mm.train
fails with:
Subsequent attempts, after calling resetGPU; Mat.clearCaches
and repeating all preceding steps in the workbook, result in:
I gradually increased the size of the training set and found the maximum input size to be ~ 200k features (using all categories.)
Sometimes I can successfully train on 230k features, but not consistently.
On another attempt:
It looks like your GPU is being used to drive the graphics display as well as computing (30% memory used before doing anything must be display memory). That reduces compute memory and also makes the calculation more likely to fail during peaks in video memory use.
re: much larger models, we're working on model partitioning across a cluster and should have something up in a couple of months.
Are there any guidelines on the amount of GPU memory necessary? I am currently trying the example given in the quickstart with the rcv1 data set and running into an out of memory error. My GPU is an older GTX 570 with 1.25 GB of memory and the data set is ~0.8GB. I was wanting to know if there are any estimates that would allow me to estimate the size of a data set that I could process with this card or for estimating the size of a data set that I could process on a given GPU with the memory. Furthermore, could a single data set be spread across multiple GPUs?