Running multiple inference servers fails

kraigrs commented 7 years ago

Referencing #9, I thought I had worked around this, and in practice this works fine to classify images one at a time with multiple models. But when I deployed this in my application where both models are called multiple times per second, I received the following error:

2017/04/10 19:30:13 Initializing Caffe classifiers
2017/04/10 19:30:14 Adding REST endpoint /api/classify
2017/04/10 19:30:14 Starting server listening on :8000
F0410 19:30:37.227159     8 syncedmem.cpp:56] Check failed: error == cudaSuccess (2 vs. 0)  out of memory
*** Check failure stack trace: ***
    @     0x7f81341c35cd  google::LogMessage::Fail()
    @     0x7f81341c5433  google::LogMessage::SendToLog()
    @     0x7f81341c315b  google::LogMessage::Flush()
    @     0x7f81341c5e1e  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f81346d5010  caffe::SyncedMemory::to_gpu()
    @     0x7f81346d3fd9  caffe::SyncedMemory::mutable_gpu_data()
    @     0x7f81346f02f2  caffe::Blob<>::mutable_gpu_data()
    @     0x7f8134714f66  caffe::PoolingLayer<>::Forward_gpu()
    @     0x7f81346da7c2  caffe::Net<>::ForwardFromTo()
    @     0x7f81346da916  caffe::Net<>::Forward()
    @           0x5e8884  Classifier::Predict()
    @           0x5e8af6  _ZN10Classifier8ClassifyB5cxx11ERKN2cv3MatEi
    @           0x5e925f  classifier_classify
    @           0x5e73f3  _cgo_53771bef0329_C2func_classifier_classify
    @           0x45eba0  (unknown)
fatal error: unexpected signal during runtime execution
[signal SIGSEGV: segmentation violation code=0x80 addr=0x0 pc=0x7f812ff14196]

runtime stack:
runtime.throw(0x678782, 0x2a)
    /usr/local/go/src/runtime/panic.go:566 +0x95
runtime.sigpanic()
    /usr/local/go/src/runtime/sigpanic_unix.go:12 +0x2cc

goroutine 6 [syscall, locked to thread]:
runtime.cgocall(0x5e73c0, 0xc42003fc18, 0x0)
    /usr/local/go/src/runtime/cgocall.go:131 +0x110 fp=0xc42003fbc8 sp=0xc42003fb88
main._C2func_classifier_classify(0x1a194860, 0xc420134000, 0x179d2, 0x0, 0x0, 0x0)
    inference/_obj/_cgo_gotypes.go:76 +0x69 fp=0xc42003fc18 sp=0xc42003fbc8
main.classify(0x972340, 0xc420067860, 0xc4200c62d0)
    /go/src/inference/main.go:33 +0x1da fp=0xc42003fc88 sp=0xc42003fc18
net/http.HandlerFunc.ServeHTTP(0x68da20, 0x972340, 0xc420067860, 0xc4200c62d0)
    /usr/local/go/src/net/http/server.go:1726 +0x44 fp=0xc42003fcb0 sp=0xc42003fc88
net/http.(*ServeMux).ServeHTTP(0x984d20, 0x972340, 0xc420067860, 0xc4200c62d0)
    /usr/local/go/src/net/http/server.go:2022 +0x7f fp=0xc42003fcf0 sp=0xc42003fcb0
net/http.serverHandler.ServeHTTP(0xc42007e280, 0x972340, 0xc420067860, 0xc4200c62d0)
    /usr/local/go/src/net/http/server.go:2202 +0x7d fp=0xc42003fd38 sp=0xc42003fcf0
net/http.(*conn).serve(0xc42007e400, 0x9727c0, 0xc420016740)
    /usr/local/go/src/net/http/server.go:1579 +0x4b7 fp=0xc42003ff98 sp=0xc42003fd38
runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:2086 +0x1 fp=0xc42003ffa0 sp=0xc42003ff98
created by net/http.(*Server).Serve
    /usr/local/go/src/net/http/server.go:2293 +0x44d

goroutine 1 [IO wait]:
net.runtime_pollWait(0x7f8134ce9f28, 0x72, 0x0)
    /usr/local/go/src/runtime/netpoll.go:160 +0x59
net.(*pollDesc).wait(0xc42005a0d0, 0x72, 0xc42004fbd8, 0xc420014100)
    /usr/local/go/src/net/fd_poll_runtime.go:73 +0x38
net.(*pollDesc).waitRead(0xc42005a0d0, 0x970140, 0xc420014100)
    /usr/local/go/src/net/fd_poll_runtime.go:78 +0x34
net.(*netFD).accept(0xc42005a070, 0x0, 0x96ee00, 0xc4200c28c0)
    /usr/local/go/src/net/fd_unix.go:419 +0x238
net.(*TCPListener).accept(0xc42002e030, 0x29e8d60800, 0x0, 0x0)
    /usr/local/go/src/net/tcpsock_posix.go:132 +0x2e
net.(*TCPListener).AcceptTCP(0xc42002e030, 0xc42004fd00, 0xc42004fd08, 0xc42004fcf8)
    /usr/local/go/src/net/tcpsock.go:209 +0x49
net/http.tcpKeepAliveListener.Accept(0xc42002e030, 0x68dc08, 0xc42007e400, 0x972880, 0xc420018f30)
    /usr/local/go/src/net/http/server.go:2608 +0x2f
net/http.(*Server).Serve(0xc42007e280, 0x972440, 0xc42002e030, 0x0, 0x0)
    /usr/local/go/src/net/http/server.go:2273 +0x1ce
net/http.(*Server).ListenAndServe(0xc42007e280, 0xc42007e280, 0x0)
    /usr/local/go/src/net/http/server.go:2219 +0xb4
net/http.ListenAndServe(0x66de6b, 0x5, 0x0, 0x0, 0xc420015010, 0x0)
    /usr/local/go/src/net/http/server.go:2351 +0xa0
main.main()
    /go/src/inference/main.go:60 +0x4a0

goroutine 17 [syscall, locked to thread]:
runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:2086 +0x1

The images that I'm passing to the listener are pretty small, so I don't know how I could run out of memory.

Any help on this would be greatly appreciated.

flx42 commented 7 years ago

Which GPU are you using? And which neural network model?

kraigrs commented 7 years ago

Both models are GoogleNet but using different training sets, and I have a single mobile GPU, NVIDIA Quadro K1100M (384 cores, 2GB memory).

Are you thinking that the models themselves are too big?

flx42 commented 7 years ago

Yes, that's probably the case. Launch each model separately, send a few images in (to be sure everything is initialized) then check the memory usage of the process using nvidia-smi on the host.

kraigrs commented 7 years ago

Here is the output of nvidia-smi right before I receive the error message from the 2nd deployed model:

Mon Apr 10 16:17:13 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 361.93.02              Driver Version: 361.93.02                 |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro K1100M       On   | 0000:01:00.0      On |                  N/A |
| N/A   51C    P0    N/A /  N/A |   1996MiB /  1998MiB |      3%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      1606    G   /usr/lib/xorg/Xorg                             137MiB |
|    0      1611    C   /usr/bin/python                                 10MiB |
|    0      2622    G   compiz                                         108MiB |
|    0      7072    G   ...s-passed-by-fd --v8-snapshot-passed-by-fd    69MiB |
|    0      9622    C   inference                                      862MiB |
|    0     21166    C   inference                                      802MiB |
+-----------------------------------------------------------------------------+

So it looks like it is in fact getting very close to the limit. Would you recommend training a more lightweight model?

flx42 commented 7 years ago

Ouch. Yes you're very very close to the limit. I think GoogleNet is fine, but you either need a GPU with more memory, or kill your X server to free some space.

NVIDIA / gpu-rest-engine

Running multiple inference servers fails #10