do you support to concurrent predict?

moowcharnfu commented 2 years ago

case: one request runs out 60ms；
under jmeter, 50 c is working at 1300ms; 60 c also; but up to 70 c , it had an error : cuda out of memory. cpu: 25000ms and multithread 72000ms ...; gpu: 60ms/one, 1300ms / 50c , 70c error.. do you have practising in multithread concurrent by spring boot ?? how to avoid and improve it?

frankfliu commented 2 years ago

@moowcharnfu

Which engine are you using? We can take a look if you have a project that can reproduce your issue.

We do support multithreading inference. If you want to know about your model's performance in multi-threading case, you can try use djl-bench to get some idea: https://github.com/deepjavalibrary/djl/tree/master/extensions/benchmark

moowcharnfu commented 2 years ago

@moowcharnfu

Which engine are you using? We can take a look if you have a project that can reproduce your issue.

We do support multithreading inference. If you want to know about your model's performance in multi-threading case, you can try use djl-bench to get some idea: https://github.com/deepjavalibrary/djl/tree/master/extensions/benchmark

model

[INFO ] - Number of inter-op threads is 1 [INFO ] - Number of intra-op threads is 1 [INFO ] - Load PyTorch (1.11.0) in 0.012 ms. [INFO ] - Running MultithreadedBenchmark on: [gpu(0)]. [INFO ] - Multithreading inference with 1 threads. [WARN ] - Simple repository pointing to a non-archive file. Loading: 100% |========================================| [INFO ] - Model zhanshi5.torchscript loaded in: 1345.695 ms. [INFO ] - Completed 1000 requests [INFO ] - Completed 2000 requests [INFO ] - Inference result: [7.270399, 5.782502, 17.617685 ...] [INFO ] - Throughput: 38.64, completed 2000 iteration in 51763 ms. [INFO ] - Model loading time: 1345.695 ms. [INFO ] - total P50: 22.321 ms, P90: 33.854 ms, P99: 42.142 ms [INFO ] - inference P50: 5.340 ms, P90: 5.834 ms, P99: 7.140 ms [INFO ] - preprocess P50: 0.051 ms, P90: 0.075 ms, P99: 0.130 ms [INFO ] - postprocess P50: 16.835 ms, P90: 28.236 ms, P99: 36.717 ms

code（springbean & pytorch ）

       long start = System.currentTimeMillis();
        PredictResultDTO predict = new PredictResultDTO();
        try {
            Image img = ImageFactory.getInstance().fromInputStream(files.getInputStream());
            var results = xiaoxian.predict(img);

            if (results.getNumberOfObjects() > 0) {
                predict = convertResult(results, img.getWidth(), img.getHeight());// just return original xywh，ignore
                return ResponseEntity.ok(predict);
            }
            predict.setMsg("**");
            return ResponseEntity.ok(predict);
        } catch (Exception e) {
            logger.error("{}", e.getMessage());
            predict.setMsg("system error...");
            return ResponseEntity.ok(predict);
        } finally {
            long end = System.currentTimeMillis();
            logger.info("time: {}", (end - start));
        }

frankfliu commented 2 years ago

OK, you system only has 1 GPU. By default, we use 1 thread per GPU in benchmark. Based on you model, 2 or 3 threads per GPU may give you better throughput.

The memory leak issue you observed may caused by your code. You didn't close your Predictor. If you are using PyTorch engine, you can:

create and close Predictor in every inference call
Create a static Predictor or create as a bean

moowcharnfu commented 2 years ago

how to load local model, is there having properties like benchmark "-p"？ the code just support remote urls? version: springboot-autoconfigure-0.15

@ConfigurationProperties("djl") public class DjlConfigurationProperties { private ApplicationType applicationType; private Class<?> inputClass; private Class<?> outputClass; private String modelArtifactId; private String[] urls; private Map<String, Object> arguments; private Map<String, String> modelFilter; ..... }

moowcharnfu commented 2 years ago

OK, you system only has 1 GPU. By default, we use 1 thread per GPU in benchmark. Based on you model, 2 or 3 threads per GPU may give you better throughput.

The memory leak issue you observed may caused by your code. You didn't close your Predictor. If you are using PyTorch engine, you can:

create and close Predictor in every inference call

Create a static Predictor or create as a bean

this is my project, you can test it,

view readme.md
edit application.properties , djl.gpu & pytorch.model_dir to run it pytorch-spring-boot.zip

frankfliu commented 2 years ago

@moowcharnfu

I tested your code on GPU instance (g4dn.2x) machine, everything looks fine. I use apache bench to test the performance:

ab -c 10 -n 20000 -k -T "multipart/form-data; boundary=ND4FBE3nV755y-pkfNT4QOhN-1y1m0o8Gn" -p multipart_224.txt http://127.0.0.1:8080/dmp/ai/predict

Server Software:        
Server Hostname:        127.0.0.1
Server Port:            8080

Document Path:          /dmp/ai/predict
Document Length:        79 bytes

Concurrency Level:      10
Time taken for tests:   180.058 seconds
Complete requests:      20000
Failed requests:        0
Keep-Alive requests:    0
Total transferred:      3680000 bytes
Total body sent:        186160000
HTML transferred:       1580000 bytes
Requests per second:    111.08 [#/sec] (mean)
Time per request:       90.029 [ms] (mean)
Time per request:       9.003 [ms] (mean, across all concurrent requests)

It can achieve 111.08 tps. and the GPU memory is stay at 2425MiB

I tested 80 concurrent connections, it use 4185MiB GPU memory. What's exactly your issue?

moowcharnfu commented 2 years ago

maybe machine, i try another

deepjavalibrary / djl-spring-boot-starter

do you support to concurrent predict? #23