deepjavalibrary / djl-serving

A universal scalable machine learning model deployment solution
Apache License 2.0
189 stars 63 forks source link

Streaming with rolling batch for starcoderbase model not working #1352

Open prgawade opened 9 months ago

prgawade commented 9 months ago

Description

Tokens not streaming not working with rolling batch

Expected Behavior

(what's the expected behavior?)

Error Message

How to Reproduce?

(If you developed your own code, please provide a short script that reproduces the error. For existing examples, please provide link.)

Steps to reproduce

Serving.properties

engine=Python option.entryPoint=djl_python.deepspeed option.rolling_batch=deepspeed option.max_rolling_batch_size=32 gpu.minWorkers=1 gpu.maxWorkers=1 log_model_metric=true metrics_aggregation=10 option.enable_streaming=deepspeed

What have you tried to solve it?

1. 2.

INFO WorkerPool loading model starcoderbase (PENDING) on gpu(0) ... INFO ModelInfo Available CPU memory: 2012568 MB, required: 0 MB, reserved: 500 MB INFO ModelInfo Available GPU memory: 80636 MB, required: 0 MB, reserved: 500 MB DEBUG DefaultModelZoo Scanning models in repo: class ai.djl.repository.SimpleRepository, /data/starcoderbase INFO ModelInfo Loading model starcoderbase on gpu(0) DEBUG ModelZoo Loading model with Criteria: Application: UNDEFINED Input: class ai.djl.modality.Input Output: class ai.djl.modality.Output Engine: Python ModelZoo: ai.djl.localmodelzoo Arguments: {"job_queue_size":"1000","engine":"Python","gpu.minWorkers":"1","log_model_metric":"true","gpu.maxWorkers":"1","metrics_aggregation":"10"} Options: {"task":"text-generation","enable_streaming":"deepspeed","rolling_batch":"deepspeed","entryPoint":"djl_python.deepspeed","max_rolling_batch_size":"32"} No translator supplied

DEBUG ModelZoo Searching model in specified model zoo: ai.djl.localmodelzoo DEBUG ModelZoo Checking ModelLoader: ai.djl.localmodelzoo:starcoderbase UNDEFINED [ ai.djl.localmodelzoo/starcoderbase/starcoderbase {} ] DEBUG MRL Preparing artifact: /data/starcoderbase, ai.djl.localmodelzoo/starcoderbase/starcoderbase {} DEBUG SimpleRepository Skip prepare for local repository. DEBUG PyModel options in serving.properties for model: starcoderbase DEBUG PyModel task=text-generation DEBUG PyModel enable_streaming=deepspeed DEBUG PyModel rolling_batch=deepspeed DEBUG PyModel entryPoint=djl_python.deepspeed DEBUG PyModel max_rolling_batch_size=32 INFO WorkerPool scaling up min workers by 1 (from 0 to 1) workers. Total range is min 1 to max 1 INFO PyProcess Start process: 19000 - retry: 0 DEBUG Connection cmd: [python3, /tmp/.djl.ai/python/0.24.0/djl_python_engine.py, --sock-type, unix, --sock-name, /tmp/djl_sock.19000, --model-dir, /data/starcoderbase, --entry-point, djl_python.deepspeed, --device-id, 0] INFO PyProcess W-83-starcoderbase-stdout: 83 - djl_python_engine started with args: ['--sock-type', 'unix', '--sock-name', '/tmp/djl_sock.19000', '--model-dir', '/data/starcoderbase', '--entry-point', 'djl_python.deepspeed', '--device-id', '0'] INFO PyProcess W-83-starcoderbase-stdout: Created a temporary directory at /tmp/tmphq3sejzb INFO PyProcess W-83-starcoderbase-stdout: Writing /tmp/tmphq3sejzb/_remote_module_non_scriptable.py INFO PyProcess W-83-starcoderbase-stdout: [2023-11-30 12:04:42,836] [INFO] [real_accelerator.py:42:get_accelerator] Setting ds_accelerator to cuda (auto detect) INFO PyProcess W-83-starcoderbase-stdout: Python engine started. INFO PyProcess W-83-starcoderbase-stdout: DeepSpeed does not currently support optimized CUDA kernels for the model type gpt_bigcode, and may not support this model for inference. Please check the DeepSpeed documentation to verify. Attempting to load model with DeepSpeed. WARN PyProcess W-83-starcoderbase-stderr: WARN PyProcess W-83-starcoderbase-stderr: Loading checkpoint shards: 0%| | 0/7 [00:00<?, ?it/s] WARN PyProcess W-83-starcoderbase-stderr: Loading checkpoint shards: 14%|?? | 1/7 [00:10<01:04, 10.76s/it] WARN PyProcess W-83-starcoderbase-stderr: Loading checkpoint shards: 29%|??? | 2/7 [00:21<00:53, 10.66s/it] WARN PyProcess W-83-starcoderbase-stderr: Loading checkpoint shards: 43%|????? | 3/7 [00:31<00:41, 10.42s/it] WARN PyProcess W-83-starcoderbase-stderr: Loading checkpoint shards: 57%|?????? | 4/7 [00:41<00:31, 10.34s/it] WARN PyProcess W-83-starcoderbase-stderr: Loading checkpoint shards: 71%|???????? | 5/7 [00:52<00:20, 10.35s/it] WARN PyProcess W-83-starcoderbase-stderr: Loading checkpoint shards: 86%|????????? | 6/7 [01:02<00:10, 10.44s/it] WARN PyProcess W-83-starcoderbase-stderr: Loading checkpoint shards: 100%|??????????| 7/7 [01:06<00:00, 8.35s/it] WARN PyProcess W-83-starcoderbase-stderr: Loading checkpoint shards: 100%|??????????| 7/7 [01:06<00:00, 9.53s/it] INFO PyProcess W-83-starcoderbase-stdout: [2023-11-30 12:05:51,163] [INFO] [logging.py:18:log_dist] [Rank -1] DeepSpeed info: version=0.10.0+6fe724c, git-hash=6fe724c, git-branch=HEAD WARN PyProcess W-83-starcoderbase-stderr: Using pad_token, but it is not set yet. INFO PyProcess W-83-starcoderbase-stdout: [2023-11-30 12:05:51,164] [INFO] [logging.py:18:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1 INFO PyProcess W-83-starcoderbase-stdout: Initialized DeepSpeed model with the following configurations INFO PyProcess W-83-starcoderbase-stdout: model: /data/starcoderbase INFO PyProcess W-83-starcoderbase-stdout: task: text-generation INFO PyProcess W-83-starcoderbase-stdout: data_type: torch.bfloat16 INFO PyProcess W-83-starcoderbase-stdout: tensor_parallel_degree: 1 INFO PyProcess Model [starcoderbase] initialized. DEBUG WorkerPool worker pool for model starcoderbase (READY): 1-fixedPool

INFO ModelServer Initialize BOTH server with: EpollServerSocketChannel. INFO ModelServer BOTH API bind to: http://0.0.0.0:8080 WARN PyProcess W-83-starcoderbase-stderr: /tmp/.djl.ai/python/0.24.0/djl_python/streaming_utils.py:313: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument. WARN PyProcess W-83-starcoderbase-stderr: probs = torch.nn.functional.softmax(logits[-1]) WARN InferenceRequestHandler Chunk reading interrupted java.lang.IllegalStateException: Read chunk timeout. at ai.djl.inference.streaming.ChunkedBytesSupplier.next(ChunkedBytesSupplier.java:79) ~[api-0.24.0.jar:?] at ai.djl.inference.streaming.ChunkedBytesSupplier.nextChunk(ChunkedBytesSupplier.java:93) ~[api-0.24.0.jar:?] at ai.djl.serving.http.InferenceRequestHandler.sendOutput(InferenceRequestHandler.java:380) ~[serving-0.24.0.jar:?] at ai.djl.serving.http.InferenceRequestHandler.lambda$runJob$5(InferenceRequestHandler.java:286) ~[serving-0.24.0.jar:?] at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) [?:?] at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837) [?:?] at java.util.concurrent.CompletableFuture$Completion.exec(CompletableFuture.java:479) [?:?] at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290) [?:?] at java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020) [?:?] at java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656) [?:?] at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594) [?:?] at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183) [?:?]

lanking520 commented 9 months ago

StarCoder is not working with DeepSpeed backend

engine=MPI
option.rolling_batch=lmi-dist
option.max_rolling_batch_size=32

please try this instead. StarCoder is supported by LMI-Dist

prgawade commented 9 months ago

thanks,

We tried the above settings

engine=MPI option.entryPoint=djl_python.deepspeed option.rolling_batch=lmi-dist option.max_rolling_batch_size=32 gpu.minWorkers=1 gpu.maxWorkers=1 log_model_metric=true metrics_aggregation=10 option.enable_streaming=true

we received chunk read timeout error, in addition we also tried the following settings

engine=MPI option.entryPoint=djl_python.huggingface option.rolling_batch=lmi-dist option.max_rolling_batch_size=32 gpu.minWorkers=1 gpu.maxWorkers=1 log_model_metric=true metrics_aggregation=10 option.enable_streaming=true

but when we try inference API request , we get the full response and not the chunked tokens , curl request is as follows -

curl --location --request POST '<>' \ --header 'Content-Type: application/json' \ --data-raw '{ "inputs": [ " //Generate RAML for the given text, ensure to include all mentioned fields with datatype, examples, methods, resources, title, schema, and so on. Follow standard RAML format.\nInput text-1:\n1- Create API definition in RAML format and MuleSoft implementation for order management/entity, API accepts and responds with JSON payload\n2 ...... << note some of the request is readacted >>" ], "parameters": { "max_new_tokens": 400, "temperature": 0.1, "top_k": 1, "top_p": 0.9 } }'

frankfliu commented 9 months ago

@prgawade

  1. If you want get application/jsonlines streaming output, you need set the following:
engine=MPI
option.rolling_batch=auto
option.max_rolling_batch_size=32
option.output_formatter=jsonlines
  1. If you use curl -N you will see the full json payload is sent with trunked encoding