deepjavalibrary / djl-serving

A universal scalable machine learning model deployment solution
Apache License 2.0
199 stars 67 forks source link

Serving failed for model "microsoft/Phi-3-vision-128k-instruct" #2502

Closed n0thing233 closed 1 month ago

n0thing233 commented 1 month ago

Description

Tried to serve model "microsoft/Phi-3-vision-128k-instruct" with several LMI images and deploy to sagemaker but failed with errors.

Expected Behavior

Expect the sagemaker endpoint to be get into running state.

Error Message

CUDA compat package requires Nvidia driver ⩽550.90.12

Current installed Nvidia driver version is 535.216.01 Setup CUDA compatibility libs path to LD_LIBRARY_PATH /usr/local/cuda/compat:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 [INFO ] ModelServer - Starting model server ... [INFO ] ModelServer - Starting djl-serving: 0.29.0 ... [INFO ] ModelServer - Model server home: /opt/djl Current directory: /opt/djl Temp directory: /tmp Command line: -Dlog4j.configurationFile=/usr/local/djl-serving-0.29.0/conf/log4j2-plain.xml -Xmx1g -Xms1g -XX:+ExitOnOutOfMemoryError -Dai.djl.util.cuda.fork=true -XX:-UseContainerSupport Number of CPUs: 96 CUDA version: 124 / 80 Number of GPUs: 8 Max heap size: 1024 Config file: /opt/djl/conf/config.properties Inference address: http://0.0.0.0:8080 Management address: http://0.0.0.0:8080 Default job_queue_size: 1000 Default batch_size: 1 Default max_batch_delay: 100 Default max_idle_time: 60 Model Store: /opt/ml/model Initial Models: ALL Netty threads: 0 Maximum Request Size: 67108864 Environment variables:    HF_HUB_ENABLE_HF_TRANSFER: 1    TENSOR_PARALLEL_DEGREE: max    OPTION_ENGINE: Python    HF_HOME: /tmp/.cache/huggingface    OPTION_TRUST_REMOTE_CODE: true    OPTION_ENFORCE_EAGER: true    OMP_NUM_THREADS: 1    OPTION_GPU_MEMORY_UTILIZATION: 0.5    SAGEMAKER_SAFE_PORT_RANGE: 28000-28999    HF_MODEL_ID: microsoft/Phi-3-vision-128k-instruct    OPTION_ROLLING_BATCH: vllm    SERVING_FEATURES: vllm,lmi-dist    OPTION_MAX_MODEL_LEN: 4096    DJL_CACHE_DIR: /tmp/.djl.ai [INFO ] FolderScanPluginManager - scanning for plugins... [INFO ] FolderScanPluginManager - scanning in plug-in folder :/opt/djl/plugins [INFO ] FolderScanPluginManager - scanning in plug-in folder :/usr/local/djl-serving-0.29.0/plugins [INFO ] PropertyFilePluginMetaDataReader - Plugin found: console/jar:file:/usr/local/djl-serving-0.29.0/plugins/management-console-0.29.0.jar!/META-INF/plugin.definition [INFO ] PropertyFilePluginMetaDataReader - Plugin found: plugin-management/jar:file:/usr/local/djl-serving-0.29.0/plugins/plugin-management-plugin-0.29.0.jar!/META-INF/plugin.definition [INFO ] PropertyFilePluginMetaDataReader - Plugin found: static-file-plugin/jar:file:/usr/local/djl-serving-0.29.0/plugins/static-file-plugin-0.29.0.jar!/META-INF/plugin.definition [INFO ] PropertyFilePluginMetaDataReader - Plugin found: secure-mode/jar:file:/usr/local/djl-serving-0.29.0/plugins/secure-mode-0.29.0.jar!/META-INF/plugin.definition [INFO ] PropertyFilePluginMetaDataReader - Plugin found: cache-engines/jar:file:/usr/local/djl-serving-0.29.0/plugins/cache-0.29.0.jar!/META-INF/plugin.definition [INFO ] PropertyFilePluginMetaDataReader - Plugin found: kserve/jar:file:/usr/local/djl-serving-0.29.0/plugins/kserve-0.29.0.jar!/META-INF/plugin.definition [INFO ] FolderScanPluginManager - Loading plugin: {console/jar:file:/usr/local/djl-serving-0.29.0/plugins/management-console-0.29.0.jar!/META-INF/plugin.definition} [INFO ] PluginMetaData - plugin console changed state to INITIALIZED [INFO ] FolderScanPluginManager - Loading plugin: {plugin-management/jar:file:/usr/local/djl-serving-0.29.0/plugins/plugin-management-plugin-0.29.0.jar!/META-INF/plugin.definition} [INFO ] PluginMetaData - plugin plugin-management changed state to INITIALIZED [INFO ] FolderScanPluginManager - Loading plugin: {static-file-plugin/jar:file:/usr/local/djl-serving-0.29.0/plugins/static-file-plugin-0.29.0.jar!/META-INF/plugin.definition} [INFO ] PluginMetaData - plugin static-file-plugin changed state to INITIALIZED [INFO ] FolderScanPluginManager - Loading plugin: {cache-engines/jar:file:/usr/local/djl-serving-0.29.0/plugins/cache-0.29.0.jar!/META-INF/plugin.definition} [INFO ] PluginMetaData - plugin cache-engines changed state to INITIALIZED [INFO ] FolderScanPluginManager - Loading plugin: {secure-mode/jar:file:/usr/local/djl-serving-0.29.0/plugins/secure-mode-0.29.0.jar!/META-INF/plugin.definition} [INFO ] PluginMetaData - plugin secure-mode changed state to INITIALIZED [INFO ] FolderScanPluginManager - Loading plugin: {kserve/jar:file:/usr/local/djl-serving-0.29.0/plugins/kserve-0.29.0.jar!/META-INF/plugin.definition} [INFO ] PluginMetaData - plugin kserve changed state to INITIALIZED [INFO ] PluginMetaData - plugin console changed state to ACTIVE reason: plugin ready [INFO ] PluginMetaData - plugin plugin-management changed state to ACTIVE reason: plugin ready [INFO ] PluginMetaData - plugin static-file-plugin changed state to ACTIVE reason: plugin ready [INFO ] PluginMetaData - plugin cache-engines changed state to ACTIVE reason: plugin ready [INFO ] PluginMetaData - plugin secure-mode changed state to ACTIVE reason: plugin ready [INFO ] PluginMetaData - plugin kserve changed state to ACTIVE reason: plugin ready [INFO ] FolderScanPluginManager - 6 plug-ins found and loaded. [INFO ] ModelServer - Initializing model: microsoft_Phi_3_vision_128k_instruct=/tmp/.djl.ai/0dc4f4f9c09dd8aeb9cba2c6169154b4d2cb1576 [WARN ] LmiUtils - Hub config file config.json does not exist for model microsoft/Phi-3-vision-128k-instruct. [WARN ] LmiUtils - Hub config file model_index.json does not exist for model microsoft/Phi-3-vision-128k-instruct. [INFO ] ModelInfo - M-0001: Apply per model settings:    job_queue_size: 1000    max_dynamic_batch_size: 1    max_batch_delay: 100    max_idle_time: 60    load_on_devices: *    engine: Python    mpi_mode: null    option.entryPoint: null    option.max_model_len: 4096    option.trust_remote_code: true    option.enforce_eager: true    option.gpu_memory_utilization: 0.5    option.model_id: microsoft/Phi-3-vision-128k-instruct    option.rolling_batch: vllm [INFO ] Platform - Found matching platform from: jar:file:/usr/local/djl-serving-0.29.0/lib/python-0.29.0.jar!/native/lib/python.properties [INFO ] PyEnv - Extracting /djl_python_engine.py to cache ... [INFO ] PyEnv - Extracting /djl_python/init.py to cache ... [INFO ] PyEnv - Extracting /djl_python/arg_parser.py to cache ... [INFO ] PyEnv - Extracting /djl_python/aws/init.py to cache ... [INFO ] PyEnv - Extracting /djl_python/aws/cloud_watch.py to cache ... [INFO ] PyEnv - Extracting /djl_python/chat_completions/init.py to cache ... [INFO ] PyEnv - Extracting /djl_python/chat_completions/chat_properties.py to cache ... [INFO ] PyEnv - Extracting /djl_python/chat_completions/chat_utils.py to cache ... [INFO ] PyEnv - Extracting /djl_python/encode_decode.py to cache ... [INFO ] PyEnv - Extracting /djl_python/huggingface.py to cache ... [INFO ] PyEnv - Extracting /djl_python/input_parser.py to cache ... [INFO ] PyEnv - Extracting /djl_python/inputs.py to cache ... [INFO ] PyEnv - Extracting /djl_python/multimodal/init.py to cache ... [INFO ] PyEnv - Extracting /djl_python/multimodal/utils.py to cache ... [INFO ] PyEnv - Extracting /djl_python/neuron_utils/init.py to cache ... [INFO ] PyEnv - Extracting /djl_python/neuron_utils/model_loader.py to cache ... [INFO ] PyEnv - Extracting /djl_python/neuron_utils/utils.py to cache ... [INFO ] PyEnv - Extracting /djl_python/np_util.py to cache ... [INFO ] PyEnv - Extracting /djl_python/output_formatter.py to cache ... [INFO ] PyEnv - Extracting /djl_python/outputs.py to cache ... [INFO ] PyEnv - Extracting /djl_python/pair_list.py to cache ... [INFO ] PyEnv - Extracting /djl_python/properties_manager/README.md to cache ... [INFO ] PyEnv - Extracting /djl_python/properties_manager/init.py to cache ... [INFO ] PyEnv - Extracting /djl_python/properties_manager/hf_properties.py to cache ... [INFO ] PyEnv - Extracting /djl_python/properties_manager/lmi_dist_rb_properties.py to cache ... [INFO ] PyEnv - Extracting /djl_python/properties_manager/properties.py to cache ... [INFO ] PyEnv - Extracting /djl_python/properties_manager/scheduler_rb_properties.py to cache ... [INFO ] PyEnv - Extracting /djl_python/properties_manager/sd_inf2_properties.py to cache ... [INFO ] PyEnv - Extracting /djl_python/properties_manager/tnx_properties.py to cache ... [INFO ] PyEnv - Extracting /djl_python/properties_manager/trt_properties.py to cache ... [INFO ] PyEnv - Extracting /djl_python/properties_manager/vllm_rb_properties.py to cache ... [INFO ] PyEnv - Extracting /djl_python/request.py to cache ... [INFO ] PyEnv - Extracting /djl_python/request_io.py to cache ... [INFO ] PyEnv - Extracting /djl_python/rolling_batch/init.py to cache ... [INFO ] PyEnv - Extracting /djl_python/rolling_batch/lmi_dist_rolling_batch.py to cache ... [INFO ] PyEnv - Extracting /djl_python/rolling_batch/neuron_rolling_batch.py to cache ... [INFO ] PyEnv - Extracting /djl_python/rolling_batch/rolling_batch.py to cache ... [INFO ] PyEnv - Extracting /djl_python/rolling_batch/rolling_batch_vllm_utils.py to cache ... [INFO ] PyEnv - Extracting /djl_python/rolling_batch/scheduler_rolling_batch.py to cache ... [INFO ] PyEnv - Extracting /djl_python/rolling_batch/trtllm_rolling_batch.py to cache ... [INFO ] PyEnv - Extracting /djl_python/rolling_batch/vllm_rolling_batch.py to cache ... [INFO ] PyEnv - Extracting /djl_python/sagemaker.py to cache ... [INFO ] PyEnv - Extracting /djl_python/seq_scheduler/init.py to cache ... [INFO ] PyEnv - Extracting /djl_python/seq_scheduler/batch.py to cache ... [INFO ] PyEnv - Extracting /djl_python/seq_scheduler/lm_block.py to cache ... [INFO ] PyEnv - Extracting /djl_python/seq_scheduler/search_config.py to cache ... [INFO ] PyEnv - Extracting /djl_python/seq_scheduler/seq_batch_scheduler.py to cache ... [INFO ] PyEnv - Extracting /djl_python/seq_scheduler/seq_batcher.py to cache ... [INFO ] PyEnv - Extracting /djl_python/seq_scheduler/seq_batcher_impl.py to cache ... [INFO ] PyEnv - Extracting /djl_python/seq_scheduler/step_generation.py to cache ... [INFO ] PyEnv - Extracting /djl_python/seq_scheduler/utils.py to cache ... [INFO ] PyEnv - Extracting /djl_python/service_loader.py to cache ... [INFO ] PyEnv - Extracting /djl_python/session_manager.py to cache ... [INFO ] PyEnv - Extracting /djl_python/sm_log_filter.py to cache ... [INFO ] PyEnv - Extracting /djl_python/stable_diffusion_inf2.py to cache ... [INFO ] PyEnv - Extracting /djl_python/streaming_utils.py to cache ... [INFO ] PyEnv - Extracting /djl_python/telemetry.py to cache ... [INFO ] PyEnv - Extracting /djl_python/tensorrt_llm.py to cache ... [INFO ] PyEnv - Extracting /djl_python/tensorrt_llm_python.py to cache ... [INFO ] PyEnv - Extracting /djl_python/test_model.py to cache ... [INFO ] PyEnv - Extracting /djl_python/three_p/init.py to cache ... [INFO ] PyEnv - Extracting /djl_python/three_p/three_p_utils.py to cache ... [INFO ] PyEnv - Extracting /djl_python/transformers_neuronx.py to cache ... [INFO ] PyEnv - Extracting /djl_python/transformers_neuronx_scheduler/init.py to cache ... [INFO ] PyEnv - Extracting /djl_python/transformers_neuronx_scheduler/optimum_modeling.py to cache ... [INFO ] PyEnv - Extracting /djl_python/transformers_neuronx_scheduler/optimum_neuron_scheduler.py to cache ... [INFO ] PyEnv - Extracting /djl_python/transformers_neuronx_scheduler/optimum_token_selector.py to cache ... [INFO ] PyEnv - Extracting /djl_python/transformers_neuronx_scheduler/slot.py to cache ... [INFO ] PyEnv - Extracting /djl_python/transformers_neuronx_scheduler/speculation.py to cache ... [INFO ] PyEnv - Extracting /djl_python/transformers_neuronx_scheduler/token_selector.py to cache ... [INFO ] PyEnv - Extracting /djl_python/transformers_neuronx_scheduler/utils.py to cache ... [INFO ] PyEnv - Extracting /djl_python/ts_service_loader.py to cache ... [INFO ] PyEnv - Extracting /djl_python/utils.py to cache ... [INFO ] ModelManager - Loading model on Python:[0] [INFO ] WorkerPool - loading model microsoft_Phi_3_vision_128k_instruct (M-0001, PENDING) on gpu(0) ... [INFO ] ModelInfo - M-0001: Available CPU memory: 1139457 MB, required: 0 MB, reserved: 500 MB [INFO ] ModelInfo - M-0001: Available GPU memory: 39916 MB, required: 0 MB, reserved: 500 MB [INFO ] ModelInfo - Loading model microsoft_Phi_3_vision_128k_instruct M-0001 on gpu(0) [INFO ] WorkerPool - scaling up min workers by 1 (from 0 to 1) workers. Total range is min 1 to max 1 [INFO ] PyProcess - Start process: 19000 - retry: 0 [INFO ] Connection - Set CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: 222 - djl_python_engine started with args: ['--sock-type', 'unix', '--sock-name', '/tmp/djl_sock.19000', '--model-dir', '/tmp/.djl.ai/0dc4f4f9c09dd8aeb9cba2c6169154b4d2cb1576', '--entry-point', '', '--device-id', '0', '--cluster-size', '1', '--recommended-entry-point', 'djl_python.huggingface'] [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: PyTorch version 2.3.1+cu121 available. [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: Python engine started. [WARN ] PyProcess - W-222-0dc4f4f9c09dd8a-stderr: A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-vision-128k-instruct: [WARN ] PyProcess - W-222-0dc4f4f9c09dd8a-stderr: - configuration_phi3_v.py [WARN ] PyProcess - W-222-0dc4f4f9c09dd8a-stderr: . Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision. [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: INFO 10-29 22:26:28 config.py:715] Defaulting to use mp for distributed inference [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: INFO 10-29 22:26:28 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='microsoft/Phi-3-vision-128k-instruct', speculative_config=None, tokenizer='microsoft/Phi-3-vision-128k-instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=8, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=microsoft/Phi-3-vision-128k-instruct, use_v2_block_manager=False, enable_prefix_caching=False) [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: INFO 10-29 22:26:29 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: Failed invoke service.invoke_handler() [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: Traceback (most recent call last): [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout:   File "/tmp/.djl.ai/python/0.29.0/djl_python_engine.py", line 161, in run_server [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout:     outputs = self.service.invoke_handler(function_name, inputs) [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout:   File "/tmp/.djl.ai/python/0.29.0/djl_python/service_loader.py", line 30, in invoke_handler [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout:     return getattr(self.module, function_name)(inputs) [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout:   File "/tmp/.djl.ai/python/0.29.0/djl_python/huggingface.py", line 538, in handle [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout:     _service.initialize(inputs.get_properties()) [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout:   File "/tmp/.djl.ai/python/0.29.0/djl_python/huggingface.py", line 135, in initialize [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout:     self.rolling_batch = _rolling_batch_cls( [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout:   File "/tmp/.djl.ai/python/0.29.0/djl_python/rolling_batch/vllm_rolling_batch.py", line 48, in init [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout:     self.engine = LLMEngine.from_engine_args(args) [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 441, in from_engine_args [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout:     engine = cls( [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 251, in init [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout:     self.model_executor = executor_class( [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 25, in init [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout:     super().init(*args, **kwargs) [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 47, in init [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout:     self._init_executor() [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 89, in _init_executor [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout:     worker = ProcessWorkerWrapper( [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_worker_utils.py", line 162, in init [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout:     self.process.start() [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout:   File "/usr/lib/python3.10/multiprocessing/process.py", line 121, in start [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout:     self._popen = self._Popen(self) [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout:   File "/usr/lib/python3.10/multiprocessing/context.py", line 288, in _Popen [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout:     return Popen(process_obj) [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout:   File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in init [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout:     super().init(process_obj) [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout:   File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 19, in init [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout:     self._launch(process_obj) [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout:   File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 47, in _launch [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout:     reduction.dump(process_obj, fp) [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout:   File "/usr/lib/python3.10/multiprocessing/reduction.py", line 60, in dump [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout:     ForkingPickler(file, protocol).dump(obj) [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: _pickle.PicklingError: Can't pickle <class 'transformers_modules.microsoft.Phi-3-vision-128k-instruct.c45209e90a4c4f7d16b2e9d48503c7f3e83623ed.configuration_phi3_v.Phi3VConfig'>: it's not the same object as transformers_modules.microsoft.Phi-3-vision-128k-instruct.c45209e90a4c4f7d16b2e9d48503c7f3e83623ed.configuration_phi3_v.Phi3VConfig [INFO ] PyProcess - Stop process: -1:222, failure=false [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: Python engine process died [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: Traceback (most recent call last): [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout:   File "/tmp/.djl.ai/python/0.29.0/djl_python_engine.py", line 207, in main [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout:     engine.run_server() [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout:   File "/tmp/.djl.ai/python/0.29.0/djl_python_engine.py", line 125, in run_server [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout:     inputs.read(cl_socket) [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout:   File "/tmp/.djl.ai/python/0.29.0/djl_python/inputs.py", line 221, in read [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout:     prop_size = retrieve_short(conn) [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout:   File "/tmp/.djl.ai/python/0.29.0/djl_python/inputs.py", line 60, in retrieve_short [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout:     data = retrieve_buffer(conn, 2) [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout:   File "/tmp/.djl.ai/python/0.29.0/djl_python/inputs.py", line 36, in retrieve_buffer [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout:     raise ValueError("Connection disconnected")

CUDA compat package requires Nvidia driver ⩽550.90.12 Current installed Nvidia driver version is 535.216.01 Setup CUDA compatibility libs path to LD_LIBRARY_PATH /usr/local/cuda/compat:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 [INFO ] ModelServer - Starting model server ... [INFO ] ModelServer - Starting djl-serving: 0.29.0 ... [INFO ] ModelServer - Model server home: /opt/djl Current directory: /opt/djl Temp directory: /tmp Command line: -Dlog4j.configurationFile=/usr/local/djl-serving-0.29.0/conf/log4j2-plain.xml -Xmx1g -Xms1g -XX:+ExitOnOutOfMemoryError -Dai.djl.util.cuda.fork=true -XX:-UseContainerSupport Number of CPUs: 96 CUDA version: 124 / 80 Number of GPUs: 8 Max heap size: 1024 Config file: /opt/djl/conf/config.properties Inference address: http://0.0.0.0:8080/ Management address: http://0.0.0.0:8080/ Default job_queue_size: 1000 Default batch_size: 1 Default max_batch_delay: 100 Default max_idle_time: 60 Model Store: /opt/ml/model Initial Models: ALL Netty threads: 0 Maximum Request Size: 67108864 "Environment variables: HF_HUB_ENABLE_HF_TRANSFER: 1 TENSOR_PARALLEL_DEGREE: max OPTION_ENGINE: Python HF_HOME: /tmp/.cache/huggingface OPTION_TRUST_REMOTE_CODE: true OPTION_ENFORCE_EAGER: true OMP_NUM_THREADS: 1 OPTION_GPU_MEMORY_UTILIZATION: 0.5 SAGEMAKER_SAFE_PORT_RANGE: 28000-28999 HF_MODEL_ID: microsoft/Phi-3-vision-128k-instruct OPTION_ROLLING_BATCH: vllm SERVING_FEATURES: vllm,lmi-dist OPTION_MAX_MODEL_LEN: 4096 DJL_CACHE_DIR: /tmp/.djl.ai" [INFO ] FolderScanPluginManager - scanning for plugins... [INFO ] FolderScanPluginManager - scanning in plug-in folder :/opt/djl/plugins [INFO ] FolderScanPluginManager - scanning in plug-in folder :/usr/local/djl-serving-0.29.0/plugins [INFO ] PropertyFilePluginMetaDataReader - Plugin found: console/jar:file:/usr/local/djl-serving-0.29.0/plugins/management-console-0.29.0.jar!/META-INF/plugin.definition [INFO ] PropertyFilePluginMetaDataReader - Plugin found: plugin-management/jar:file:/usr/local/djl-serving-0.29.0/plugins/plugin-management-plugin-0.29.0.jar!/META-INF/plugin.definition [INFO ] PropertyFilePluginMetaDataReader - Plugin found: static-file-plugin/jar:file:/usr/local/djl-serving-0.29.0/plugins/static-file-plugin-0.29.0.jar!/META-INF/plugin.definition [INFO ] PropertyFilePluginMetaDataReader - Plugin found: secure-mode/jar:file:/usr/local/djl-serving-0.29.0/plugins/secure-mode-0.29.0.jar!/META-INF/plugin.definition [INFO ] PropertyFilePluginMetaDataReader - Plugin found: cache-engines/jar:file:/usr/local/djl-serving-0.29.0/plugins/cache-0.29.0.jar!/META-INF/plugin.definition [INFO ] PropertyFilePluginMetaDataReader - Plugin found: kserve/jar:file:/usr/local/djl-serving-0.29.0/plugins/kserve-0.29.0.jar!/META-INF/plugin.definition [INFO ] FolderScanPluginManager - Loading plugin: {console/jar:file:/usr/local/djl-serving-0.29.0/plugins/management-console-0.29.0.jar!/META-INF/plugin.definition} [INFO ] PluginMetaData - plugin console changed state to INITIALIZED [INFO ] FolderScanPluginManager - Loading plugin: {plugin-management/jar:file:/usr/local/djl-serving-0.29.0/plugins/plugin-management-plugin-0.29.0.jar!/META-INF/plugin.definition} [INFO ] PluginMetaData - plugin plugin-management changed state to INITIALIZED [INFO ] FolderScanPluginManager - Loading plugin: {static-file-plugin/jar:file:/usr/local/djl-serving-0.29.0/plugins/static-file-plugin-0.29.0.jar!/META-INF/plugin.definition} [INFO ] PluginMetaData - plugin static-file-plugin changed state to INITIALIZED [INFO ] FolderScanPluginManager - Loading plugin: {cache-engines/jar:file:/usr/local/djl-serving-0.29.0/plugins/cache-0.29.0.jar!/META-INF/plugin.definition} [INFO ] PluginMetaData - plugin cache-engines changed state to INITIALIZED [INFO ] FolderScanPluginManager - Loading plugin: {secure-mode/jar:file:/usr/local/djl-serving-0.29.0/plugins/secure-mode-0.29.0.jar!/META-INF/plugin.definition} [INFO ] PluginMetaData - plugin secure-mode changed state to INITIALIZED [INFO ] FolderScanPluginManager - Loading plugin: {kserve/jar:file:/usr/local/djl-serving-0.29.0/plugins/kserve-0.29.0.jar!/META-INF/plugin.definition} [INFO ] PluginMetaData - plugin kserve changed state to INITIALIZED [INFO ] PluginMetaData - plugin console changed state to ACTIVE reason: plugin ready [INFO ] PluginMetaData - plugin plugin-management changed state to ACTIVE reason: plugin ready [INFO ] PluginMetaData - plugin static-file-plugin changed state to ACTIVE reason: plugin ready [INFO ] PluginMetaData - plugin cache-engines changed state to ACTIVE reason: plugin ready [INFO ] PluginMetaData - plugin secure-mode changed state to ACTIVE reason: plugin ready [INFO ] PluginMetaData - plugin kserve changed state to ACTIVE reason: plugin ready [INFO ] FolderScanPluginManager - 6 plug-ins found and loaded. [INFO ] ModelServer - Initializing model: microsoft_Phi_3_vision_128k_instruct=/tmp/.djl.ai/0dc4f4f9c09dd8aeb9cba2c6169154b4d2cb1576 [WARN ] LmiUtils - Hub config file config.json does not exist for model microsoft/Phi-3-vision-128k-instruct. [WARN ] LmiUtils - Hub config file model_index.json does not exist for model microsoft/Phi-3-vision-128k-instruct. "[INFO ] ModelInfo - M-0001: Apply per model settings: job_queue_size: 1000 max_dynamic_batch_size: 1 max_batch_delay: 100 max_idle_time: 60 load_on_devices: * engine: Python mpi_mode: null option.entryPoint: null option.max_model_len: 4096 option.trust_remote_code: true option.enforce_eager: true option.gpu_memory_utilization: 0.5 option.model_id: microsoft/Phi-3-vision-128k-instruct option.rolling_batch: vllm" [INFO ] Platform - Found matching platform from: jar:file:/usr/local/djl-serving-0.29.0/lib/python-0.29.0.jar!/native/lib/python.properties [INFO ] PyEnv - Extracting /djl_python_engine.py to cache ... [INFO ] PyEnv - Extracting /djl_python/init.py to cache ... [INFO ] PyEnv - Extracting /djl_python/arg_parser.py to cache ... [INFO ] PyEnv - Extracting /djl_python/aws/init.py to cache ... [INFO ] PyEnv - Extracting /djl_python/aws/cloud_watch.py to cache ... [INFO ] PyEnv - Extracting /djl_python/chat_completions/init.py to cache ... [INFO ] PyEnv - Extracting /djl_python/chat_completions/chat_properties.py to cache ... [INFO ] PyEnv - Extracting /djl_python/chat_completions/chat_utils.py to cache ... [INFO ] PyEnv - Extracting /djl_python/encode_decode.py to cache ... [INFO ] PyEnv - Extracting /djl_python/huggingface.py to cache ... [INFO ] PyEnv - Extracting /djl_python/input_parser.py to cache ... [INFO ] PyEnv - Extracting /djl_python/inputs.py to cache ... [INFO ] PyEnv - Extracting /djl_python/multimodal/init.py to cache ... [INFO ] PyEnv - Extracting /djl_python/multimodal/utils.py to cache ... [INFO ] PyEnv - Extracting /djl_python/neuron_utils/init.py to cache ... [INFO ] PyEnv - Extracting /djl_python/neuron_utils/model_loader.py to cache ... [INFO ] PyEnv - Extracting /djl_python/neuron_utils/utils.py to cache ... [INFO ] PyEnv - Extracting /djl_python/np_util.py to cache ... [INFO ] PyEnv - Extracting /djl_python/output_formatter.py to cache ... [INFO ] PyEnv - Extracting /djl_python/outputs.py to cache ... [INFO ] PyEnv - Extracting /djl_python/pair_list.py to cache ... [INFO ] PyEnv - Extracting /djl_python/properties_manager/README.md to cache ... [INFO ] PyEnv - Extracting /djl_python/properties_manager/init.py to cache ... [INFO ] PyEnv - Extracting /djl_python/properties_manager/hf_properties.py to cache ... [INFO ] PyEnv - Extracting /djl_python/properties_manager/lmi_dist_rb_properties.py to cache ... [INFO ] PyEnv - Extracting /djl_python/properties_manager/properties.py to cache ... [INFO ] PyEnv - Extracting /djl_python/properties_manager/scheduler_rb_properties.py to cache ... [INFO ] PyEnv - Extracting /djl_python/properties_manager/sd_inf2_properties.py to cache ... [INFO ] PyEnv - Extracting /djl_python/properties_manager/tnx_properties.py to cache ... [INFO ] PyEnv - Extracting /djl_python/properties_manager/trt_properties.py to cache ... [INFO ] PyEnv - Extracting /djl_python/properties_manager/vllm_rb_properties.py to cache ... [INFO ] PyEnv - Extracting /djl_python/request.py to cache ... [INFO ] PyEnv - Extracting /djl_python/request_io.py to cache ... [INFO ] PyEnv - Extracting /djl_python/rolling_batch/init.py to cache ... [INFO ] PyEnv - Extracting /djl_python/rolling_batch/lmi_dist_rolling_batch.py to cache ... [INFO ] PyEnv - Extracting /djl_python/rolling_batch/neuron_rolling_batch.py to cache ... [INFO ] PyEnv - Extracting /djl_python/rolling_batch/rolling_batch.py to cache ... [INFO ] PyEnv - Extracting /djl_python/rolling_batch/rolling_batch_vllm_utils.py to cache ... [INFO ] PyEnv - Extracting /djl_python/rolling_batch/scheduler_rolling_batch.py to cache ... [INFO ] PyEnv - Extracting /djl_python/rolling_batch/trtllm_rolling_batch.py to cache ... [INFO ] PyEnv - Extracting /djl_python/rolling_batch/vllm_rolling_batch.py to cache ... [INFO ] PyEnv - Extracting /djl_python/sagemaker.py to cache ... [INFO ] PyEnv - Extracting /djl_python/seq_scheduler/init.py to cache ... [INFO ] PyEnv - Extracting /djl_python/seq_scheduler/batch.py to cache ... [INFO ] PyEnv - Extracting /djl_python/seq_scheduler/lm_block.py to cache ... [INFO ] PyEnv - Extracting /djl_python/seq_scheduler/search_config.py to cache ... [INFO ] PyEnv - Extracting /djl_python/seq_scheduler/seq_batch_scheduler.py to cache ... [INFO ] PyEnv - Extracting /djl_python/seq_scheduler/seq_batcher.py to cache ... [INFO ] PyEnv - Extracting /djl_python/seq_scheduler/seq_batcher_impl.py to cache ... [INFO ] PyEnv - Extracting /djl_python/seq_scheduler/step_generation.py to cache ... [INFO ] PyEnv - Extracting /djl_python/seq_scheduler/utils.py to cache ... [INFO ] PyEnv - Extracting /djl_python/service_loader.py to cache ... [INFO ] PyEnv - Extracting /djl_python/session_manager.py to cache ... [INFO ] PyEnv - Extracting /djl_python/sm_log_filter.py to cache ... [INFO ] PyEnv - Extracting /djl_python/stable_diffusion_inf2.py to cache ... [INFO ] PyEnv - Extracting /djl_python/streaming_utils.py to cache ... [INFO ] PyEnv - Extracting /djl_python/telemetry.py to cache ... [INFO ] PyEnv - Extracting /djl_python/tensorrt_llm.py to cache ... [INFO ] PyEnv - Extracting /djl_python/tensorrt_llm_python.py to cache ... [INFO ] PyEnv - Extracting /djl_python/test_model.py to cache ... [INFO ] PyEnv - Extracting /djl_python/three_p/init.py to cache ... [INFO ] PyEnv - Extracting /djl_python/three_p/three_p_utils.py to cache ... [INFO ] PyEnv - Extracting /djl_python/transformers_neuronx.py to cache ... [INFO ] PyEnv - Extracting /djl_python/transformers_neuronx_scheduler/init.py to cache ... [INFO ] PyEnv - Extracting /djl_python/transformers_neuronx_scheduler/optimum_modeling.py to cache ... [INFO ] PyEnv - Extracting /djl_python/transformers_neuronx_scheduler/optimum_neuron_scheduler.py to cache ... [INFO ] PyEnv - Extracting /djl_python/transformers_neuronx_scheduler/optimum_token_selector.py to cache ... [INFO ] PyEnv - Extracting /djl_python/transformers_neuronx_scheduler/slot.py to cache ... [INFO ] PyEnv - Extracting /djl_python/transformers_neuronx_scheduler/speculation.py to cache ... [INFO ] PyEnv - Extracting /djl_python/transformers_neuronx_scheduler/token_selector.py to cache ... [INFO ] PyEnv - Extracting /djl_python/transformers_neuronx_scheduler/utils.py to cache ... [INFO ] PyEnv - Extracting /djl_python/ts_service_loader.py to cache ... [INFO ] PyEnv - Extracting /djl_python/utils.py to cache ... [INFO ] ModelManager - Loading model on Python:[0] [INFO ] WorkerPool - loading model microsoft_Phi_3_vision_128k_instruct (M-0001, PENDING) on gpu(0) ... [INFO ] ModelInfo - M-0001: Available CPU memory: 1139457 MB, required: 0 MB, reserved: 500 MB [INFO ] ModelInfo - M-0001: Available GPU memory: 39916 MB, required: 0 MB, reserved: 500 MB [INFO ] ModelInfo - Loading model microsoft_Phi_3_vision_128k_instruct M-0001 on gpu(0) [INFO ] WorkerPool - scaling up min workers by 1 (from 0 to 1) workers. Total range is min 1 to max 1 [INFO ] PyProcess - Start process: 19000 - retry: 0 [INFO ] Connection - Set CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: 222 - djl_python_engine started with args: ['--sock-type', 'unix', '--sock-name', '/tmp/djl_sock.19000', '--model-dir', '/tmp/.djl.ai/0dc4f4f9c09dd8aeb9cba2c6169154b4d2cb1576', '--entry-point', '', '--device-id', '0', '--cluster-size', '1', '--recommended-entry-point', 'djl_python.huggingface'] [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: PyTorch version 2.3.1+cu121 available. [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: Python engine started. [WARN ] PyProcess - W-222-0dc4f4f9c09dd8a-stderr: A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-vision-128k-instruct: [WARN ] PyProcess - W-222-0dc4f4f9c09dd8a-stderr: - configuration_phi3_v.py [WARN ] PyProcess - W-222-0dc4f4f9c09dd8a-stderr: . Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision. [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: INFO 10-29 22:26:28 config.py:715] Defaulting to use mp for distributed inference [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: INFO 10-29 22:26:28 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='microsoft/Phi-3-vision-128k-instruct', speculative_config=None, tokenizer='microsoft/Phi-3-vision-128k-instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=8, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=microsoft/Phi-3-vision-128k-instruct, use_v2_block_manager=False, enable_prefix_caching=False) [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: INFO 10-29 22:26:29 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: Failed invoke service.invoke_handler() [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: Traceback (most recent call last): [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: File "/tmp/.djl.ai/python/0.29.0/djl_python_engine.py", line 161, in run_server [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: outputs = self.service.invoke_handler(function_name, inputs) [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: File "/tmp/.djl.ai/python/0.29.0/djl_python/service_loader.py", line 30, in invoke_handler [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: return getattr(self.module, function_name)(inputs) [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: File "/tmp/.djl.ai/python/0.29.0/djl_python/huggingface.py", line 538, in handle [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: _service.initialize(inputs.get_properties()) [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: File "/tmp/.djl.ai/python/0.29.0/djl_python/huggingface.py", line 135, in initialize [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: self.rolling_batch = _rolling_batch_cls( [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: File "/tmp/.djl.ai/python/0.29.0/djl_python/rolling_batch/vllm_rolling_batch.py", line 48, in init [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: self.engine = LLMEngine.from_engine_args(args) [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 441, in from_engine_args [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: engine = cls( [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 251, in init [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: self.model_executor = executor_class( [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 25, in init [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: super().init(*args, **kwargs) [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 47, in init [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: self._init_executor() [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 89, in _init_executor [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: worker = ProcessWorkerWrapper( [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_worker_utils.py", line 162, in init [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: self.process.start() [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: File "/usr/lib/python3.10/multiprocessing/process.py", line 121, in start [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: self._popen = self._Popen(self) [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: File "/usr/lib/python3.10/multiprocessing/context.py", line 288, in _Popen [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: return Popen(process_obj) [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in init [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: super().init(process_obj) [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 19, in init [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: self._launch(process_obj) [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 47, in _launch [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: reduction.dump(process_obj, fp) [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: File "/usr/lib/python3.10/multiprocessing/reduction.py", line 60, in dump [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: ForkingPickler(file, protocol).dump(obj) [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: _pickle.PicklingError: Can't pickle <class 'transformers_modules.microsoft.Phi-3-vision-128k-instruct.c45209e90a4c4f7d16b2e9d48503c7f3e83623ed.configuration_phi3_v.Phi3VConfig'>: it's not the same object as transformers_modules.microsoft.Phi-3-vision-128k-instruct.c45209e90a4c4f7d16b2e9d48503c7f3e83623ed.configuration_phi3_v.Phi3VConfig [INFO ] PyProcess - Stop process: -1:222, failure=false [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: Python engine process died [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: Traceback (most recent call last): [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: File "/tmp/.djl.ai/python/0.29.0/djl_python_engine.py", line 207, in main [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: engine.run_server() [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: File "/tmp/.djl.ai/python/0.29.0/djl_python_engine.py", line 125, in run_server [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: inputs.read(cl_socket) [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: File "/tmp/.djl.ai/python/0.29.0/djl_python/inputs.py", line 221, in read [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: prop_size = retrieve_short(conn) [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: File "/tmp/.djl.ai/python/0.29.0/djl_python/inputs.py", line 60, in retrieve_short [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: data = retrieve_buffer(conn, 2) [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: File "/tmp/.djl.ai/python/0.29.0/djl_python/inputs.py", line 36, in retrieve_buffer [INFO ] PyProcess - W-222-0dc4f4f9c09dd8a-stdout: raise ValueError("Connection disconnected")

How to Reproduce?

import sagemaker
from sagemaker.djl_inference.model import DJLModel

role = 'your_role'
session = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
image_uri = '763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.30.0-lmi12.0.0-cu124'
model_id = "microsoft/Phi-3-vision-128k-instruct"
env = {
    "TENSOR_PARALLEL_DEGREE": "max",            # use all gpu
    "OPTION_ROLLING_BATCH": "vllm",           # use vllm for rolling batching
    "OPTION_TRUST_REMOTE_CODE": "true",
    "OPTION_MAX_MODEL_LEN": "4096",
    "OPTION_ENFORCE_EAGER": "true",
    "OPTION_GPU_MEMORY_UTILIZATION": "0.5",
}
model = DJLModel(
    model_id=model_id,
    env=env,
    role=role)
instance_type = "ml.p4d.24xlarge"
endpoint_name = sagemaker.utils.name_from_base("lmi-model")
predictor = model.deploy(initial_instance_count=1,
             instance_type=instance_type,
             endpoint_name=endpoint_name,
             # container_startup_health_check_timeout=3600
            )

Steps to reproduce

import sagemaker
from sagemaker.djl_inference.model import DJLModel

role = 'your_role'
session = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
image_uri = '763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.30.0-lmi12.0.0-cu124'
model_id = "microsoft/Phi-3-vision-128k-instruct"
env = {
    "TENSOR_PARALLEL_DEGREE": "max",            # use all gpu
    "OPTION_ROLLING_BATCH": "vllm",           # use vllm for rolling batching
    "OPTION_TRUST_REMOTE_CODE": "true",
    "OPTION_MAX_MODEL_LEN": "4096",
    "OPTION_ENFORCE_EAGER": "true",
    "OPTION_GPU_MEMORY_UTILIZATION": "0.5",
}
model = DJLModel(
    model_id=model_id,
    env=env,
    role=role)
instance_type = "ml.p4d.24xlarge"
endpoint_name = sagemaker.utils.name_from_base("lmi-model")
predictor = model.deploy(initial_instance_count=1,
             instance_type=instance_type,
             endpoint_name=endpoint_name,
             # container_startup_health_check_timeout=3600
            )

What have you tried to solve it?

I tried several LMI images but no luck.

siddvenk commented 1 month ago

In your sample code, you're not using the image_uri you specify above (you specify the 0.30.0 image_uri but don't actually use it to deploy).

There was an issue in vllm https://github.com/vllm-project/vllm/issues/8288 that looks like what you are encountering. We had also observed that in were able to work around it using tensor parallel degree 1. That applies to the 0.29.0 LMI container, which is what you are using.

If you pass in the 0.30.0 image_uri to the DJL model class, this should work with tensor parallelism as the issue was fixed in vllm. I just validated this with the 0.30.0 image using tensor parallelism and it works.