deepjavalibrary / djl-serving

A universal scalable machine learning model deployment solution
Apache License 2.0
189 stars 63 forks source link

rolling batch does not work #1173

Open prgawade opened 10 months ago

prgawade commented 10 months ago

Description

We have deployed a salesforce codegen-2b-multi model on a Nvidia GPU infrastructure with the following serving.properties

engine=MPI option.rolling_batch=lmi-dist # tested with both lmi-dist and auto option.max_rolling_batch_size=8 option.max_rolling_batch_prefill_tokens=1088 option.paged_attention=false option.model_loading_timeout = 3600 option.entryPoint=djl_python.deepspeed chunked_read_timeout= 3 option.tensor_parallel_degree=1 option.task=text-generation option.dtype=fp16 gpu.minWorkers=1 gpu.maxWorkers=1 log_model_metric=true metrics_aggregation=10

Expected Behavior

Rolling batching should be supported for DJL serving

Error Message

INFO ModelServer BOTH API bind to: http://0.0.0.0:8080 WARN PyProcess W-88-models-stderr: [1,0]:Setting pad_token_id to eos_token_id:50256 for open-end generation. WARN InferenceRequestHandler Chunk reading interrupted java.lang.IllegalStateException: Read chunk timeout. at ai.djl.inference.streaming.ChunkedBytesSupplier.next(ChunkedBytesSupplier.java:79) ~[api-0.23.0.jar:?] at ai.djl.inference.streaming.ChunkedBytesSupplier.nextChunk(ChunkedBytesSupplier.java:93) ~[api-0.23.0.jar:?] at ai.djl.serving.http.InferenceRequestHandler.sendOutput(InferenceRequestHandler.java:380) ~[serving-0.23.0.jar:?] at ai.djl.serving.http.InferenceRequestHandler.lambda$runJob$5(InferenceRequestHandler.java:286) ~[serving-0.23.0.jar:?] at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) [?:?] at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837) [?:?] at java.util.concurrent.CompletableFuture$Completion.exec(CompletableFuture.java:479) [?:?] at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290) [?:?] at java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020) [?:?] at java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656) [?:?] at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594) [?:?] at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183) [?:?]

lanking520 commented 10 months ago

CodeGen model is not part of supporting models for RollingBatch with MPI

engine=Python
option.rolling_batch=auto
option.max_rolling_batch_size=8
option.tensor_parallel_degree=1
option.task=text-generation
option.dtype=fp16

please try this settings with your model and see if it works.