Open prgawade opened 9 months ago
StarCoder is not working with DeepSpeed backend
engine=MPI
option.rolling_batch=lmi-dist
option.max_rolling_batch_size=32
please try this instead. StarCoder is supported by LMI-Dist
thanks,
We tried the above settings
engine=MPI option.entryPoint=djl_python.deepspeed option.rolling_batch=lmi-dist option.max_rolling_batch_size=32 gpu.minWorkers=1 gpu.maxWorkers=1 log_model_metric=true metrics_aggregation=10 option.enable_streaming=true
we received chunk read timeout error, in addition we also tried the following settings
engine=MPI option.entryPoint=djl_python.huggingface option.rolling_batch=lmi-dist option.max_rolling_batch_size=32 gpu.minWorkers=1 gpu.maxWorkers=1 log_model_metric=true metrics_aggregation=10 option.enable_streaming=true
but when we try inference API request , we get the full response and not the chunked tokens , curl request is as follows -
curl --location --request POST '<
@prgawade
application/jsonlines
streaming output, you need set the following:engine=MPI
option.rolling_batch=auto
option.max_rolling_batch_size=32
option.output_formatter=jsonlines
curl -N
you will see the full json payload is sent with trunked encoding
Description
Tokens not streaming not working with rolling batch
Expected Behavior
(what's the expected behavior?)
Error Message
How to Reproduce?
(If you developed your own code, please provide a short script that reproduces the error. For existing examples, please provide link.)
Steps to reproduce
Serving.properties
engine=Python option.entryPoint=djl_python.deepspeed option.rolling_batch=deepspeed option.max_rolling_batch_size=32 gpu.minWorkers=1 gpu.maxWorkers=1 log_model_metric=true metrics_aggregation=10 option.enable_streaming=deepspeed
What have you tried to solve it?
1. 2.
INFO WorkerPool loading model starcoderbase (PENDING) on gpu(0) ... INFO ModelInfo Available CPU memory: 2012568 MB, required: 0 MB, reserved: 500 MB INFO ModelInfo Available GPU memory: 80636 MB, required: 0 MB, reserved: 500 MB DEBUG DefaultModelZoo Scanning models in repo: class ai.djl.repository.SimpleRepository, /data/starcoderbase INFO ModelInfo Loading model starcoderbase on gpu(0) DEBUG ModelZoo Loading model with Criteria: Application: UNDEFINED Input: class ai.djl.modality.Input Output: class ai.djl.modality.Output Engine: Python ModelZoo: ai.djl.localmodelzoo Arguments: {"job_queue_size":"1000","engine":"Python","gpu.minWorkers":"1","log_model_metric":"true","gpu.maxWorkers":"1","metrics_aggregation":"10"} Options: {"task":"text-generation","enable_streaming":"deepspeed","rolling_batch":"deepspeed","entryPoint":"djl_python.deepspeed","max_rolling_batch_size":"32"} No translator supplied
DEBUG ModelZoo Searching model in specified model zoo: ai.djl.localmodelzoo DEBUG ModelZoo Checking ModelLoader: ai.djl.localmodelzoo:starcoderbase UNDEFINED [ ai.djl.localmodelzoo/starcoderbase/starcoderbase {} ] DEBUG MRL Preparing artifact: /data/starcoderbase, ai.djl.localmodelzoo/starcoderbase/starcoderbase {} DEBUG SimpleRepository Skip prepare for local repository. DEBUG PyModel options in serving.properties for model: starcoderbase DEBUG PyModel task=text-generation DEBUG PyModel enable_streaming=deepspeed DEBUG PyModel rolling_batch=deepspeed DEBUG PyModel entryPoint=djl_python.deepspeed DEBUG PyModel max_rolling_batch_size=32 INFO WorkerPool scaling up min workers by 1 (from 0 to 1) workers. Total range is min 1 to max 1 INFO PyProcess Start process: 19000 - retry: 0 DEBUG Connection cmd: [python3, /tmp/.djl.ai/python/0.24.0/djl_python_engine.py, --sock-type, unix, --sock-name, /tmp/djl_sock.19000, --model-dir, /data/starcoderbase, --entry-point, djl_python.deepspeed, --device-id, 0] INFO PyProcess W-83-starcoderbase-stdout: 83 - djl_python_engine started with args: ['--sock-type', 'unix', '--sock-name', '/tmp/djl_sock.19000', '--model-dir', '/data/starcoderbase', '--entry-point', 'djl_python.deepspeed', '--device-id', '0'] INFO PyProcess W-83-starcoderbase-stdout: Created a temporary directory at /tmp/tmphq3sejzb INFO PyProcess W-83-starcoderbase-stdout: Writing /tmp/tmphq3sejzb/_remote_module_non_scriptable.py INFO PyProcess W-83-starcoderbase-stdout: [2023-11-30 12:04:42,836] [INFO] [real_accelerator.py:42:get_accelerator] Setting ds_accelerator to cuda (auto detect) INFO PyProcess W-83-starcoderbase-stdout: Python engine started. INFO PyProcess W-83-starcoderbase-stdout: DeepSpeed does not currently support optimized CUDA kernels for the model type gpt_bigcode, and may not support this model for inference. Please check the DeepSpeed documentation to verify. Attempting to load model with DeepSpeed. WARN PyProcess W-83-starcoderbase-stderr: WARN PyProcess W-83-starcoderbase-stderr: Loading checkpoint shards: 0%| | 0/7 [00:00<?, ?it/s] WARN PyProcess W-83-starcoderbase-stderr: Loading checkpoint shards: 14%|?? | 1/7 [00:10<01:04, 10.76s/it] WARN PyProcess W-83-starcoderbase-stderr: Loading checkpoint shards: 29%|??? | 2/7 [00:21<00:53, 10.66s/it] WARN PyProcess W-83-starcoderbase-stderr: Loading checkpoint shards: 43%|????? | 3/7 [00:31<00:41, 10.42s/it] WARN PyProcess W-83-starcoderbase-stderr: Loading checkpoint shards: 57%|?????? | 4/7 [00:41<00:31, 10.34s/it] WARN PyProcess W-83-starcoderbase-stderr: Loading checkpoint shards: 71%|???????? | 5/7 [00:52<00:20, 10.35s/it] WARN PyProcess W-83-starcoderbase-stderr: Loading checkpoint shards: 86%|????????? | 6/7 [01:02<00:10, 10.44s/it] WARN PyProcess W-83-starcoderbase-stderr: Loading checkpoint shards: 100%|??????????| 7/7 [01:06<00:00, 8.35s/it] WARN PyProcess W-83-starcoderbase-stderr: Loading checkpoint shards: 100%|??????????| 7/7 [01:06<00:00, 9.53s/it] INFO PyProcess W-83-starcoderbase-stdout: [2023-11-30 12:05:51,163] [INFO] [logging.py:18:log_dist] [Rank -1] DeepSpeed info: version=0.10.0+6fe724c, git-hash=6fe724c, git-branch=HEAD WARN PyProcess W-83-starcoderbase-stderr: Using pad_token, but it is not set yet. INFO PyProcess W-83-starcoderbase-stdout: [2023-11-30 12:05:51,164] [INFO] [logging.py:18:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1 INFO PyProcess W-83-starcoderbase-stdout: Initialized DeepSpeed model with the following configurations INFO PyProcess W-83-starcoderbase-stdout: model: /data/starcoderbase INFO PyProcess W-83-starcoderbase-stdout: task: text-generation INFO PyProcess W-83-starcoderbase-stdout: data_type: torch.bfloat16 INFO PyProcess W-83-starcoderbase-stdout: tensor_parallel_degree: 1 INFO PyProcess Model [starcoderbase] initialized. DEBUG WorkerPool worker pool for model starcoderbase (READY): 1-fixedPool
INFO ModelServer Initialize BOTH server with: EpollServerSocketChannel. INFO ModelServer BOTH API bind to: http://0.0.0.0:8080 WARN PyProcess W-83-starcoderbase-stderr: /tmp/.djl.ai/python/0.24.0/djl_python/streaming_utils.py:313: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument. WARN PyProcess W-83-starcoderbase-stderr: probs = torch.nn.functional.softmax(logits[-1]) WARN InferenceRequestHandler Chunk reading interrupted java.lang.IllegalStateException: Read chunk timeout. at ai.djl.inference.streaming.ChunkedBytesSupplier.next(ChunkedBytesSupplier.java:79) ~[api-0.24.0.jar:?] at ai.djl.inference.streaming.ChunkedBytesSupplier.nextChunk(ChunkedBytesSupplier.java:93) ~[api-0.24.0.jar:?] at ai.djl.serving.http.InferenceRequestHandler.sendOutput(InferenceRequestHandler.java:380) ~[serving-0.24.0.jar:?] at ai.djl.serving.http.InferenceRequestHandler.lambda$runJob$5(InferenceRequestHandler.java:286) ~[serving-0.24.0.jar:?] at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) [?:?] at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837) [?:?] at java.util.concurrent.CompletableFuture$Completion.exec(CompletableFuture.java:479) [?:?] at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290) [?:?] at java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020) [?:?] at java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656) [?:?] at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594) [?:?] at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183) [?:?]