deepjavalibrary / djl

An Engine-Agnostic Deep Learning Framework in Java
https://djl.ai
Apache License 2.0
4.05k stars 648 forks source link

Model conversion process failed when deploying Mixtral 8x22B AWQ with djl-tensorrtllm to Sagemaker #3343

Open gsjoy8888 opened 1 month ago

gsjoy8888 commented 1 month ago

Description

Model conversion process failed with djl-tensorrtllm and below serving.properties:

image_uri = image_uris.retrieve(
        framework="djl-tensorrtllm",
        region=sess.boto_session.region_name,
        version="0.28.0"
    )

%%writefile serving.properties
engine=MPI
option.model_id=MaziyarPanahi/Mixtral-8x22B-Instruct-v0.1-AWQ
option.tensor_parallel_degree=4
option.quantize=awq
option.max_num_tokens=8192
option.max_rolling_batch_size=8

Expected Behavior

(what's the expected behavior?)

Error Message

| 1721194930489 | [INFO ] LmiUtils - Detected mpi_mode: true, rolling_batch: trtllm, tensor_parallel_degree 4, for modelType: mixtral | | 1721194930489 | [INFO ] ModelInfo - M-0001: Apply per model settings: job_queue_size: 1000 max_dynamic_batch_size: 1 max_batch_delay: 100 max_idle_time: 60 load_on_devices: * engine: MPI mpi_mode: true option.entryPoint: null option.tensor_parallel_degree: 4 option.max_rolling_batch_size: 8 option.quantize: awq option.mpi_mode: true option.max_num_tokens: 8192 option.model_id: MaziyarPanahi/Mixtral-8x22B-Instruct-v0.1-AWQ option.rolling_batch: trtllm | | 1721194933027 | [INFO ] LmiUtils - Converting model to TensorRT-LLM artifacts | | 1721194933027 | [INFO ] LmiUtils - convert_py: [LMI TRTLLM Toolkit][152][INFO] PyTorch version 2.2.1 available. | | 1721194933493 | [INFO ] LmiUtils - convert_py: [LMI TRTLLM Toolkit][152][INFO] JAX version 0.4.30 available. | | 1721194933493 | [INFO ] LmiUtils - convert_py: [TensorRT-LLM] TensorRT-LLM version: 0.9.0 | | 1721194933493 | [INFO ] LmiUtils - convert_py: [LMI TRTLLM Toolkit][152][INFO] Received kwargs for tensorrt_llm_toolkit.create_model_repo: dict_items([('engine', 'MPI'), ('model_id', 'MaziyarPanahi/Mixtral-8x22B-Instruct-v0.1-AWQ'), ('tensor_parallel_degree', 4), ('quantize', 'awq'), ('max_num_tokens', '8192'), ('max_rolling_batch_size', '8'), ('trt_llm_model_repo', '/tmp/.djl.ai/trtllm/c1e40db56ea23fb1ec359dff353cdb9a752a827c')]) | | 1721194933493 | [INFO ] LmiUtils - convert_py: /usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True. | | 1721194933743 | [INFO ] LmiUtils - convert_py: warnings.warn( | | 1721194933743 | [INFO ] LmiUtils - convert_py: [LMI TRTLLM Toolkit][152][INFO] Selecting ModelBuilder | | 1721194933743 | [INFO ] LmiUtils - convert_py: [LMI TRTLLM Toolkit][152][INFO] Configuring model (will download if not available locally): MaziyarPanahi/Mixtral-8x22B-Instruct-v0.1-AWQ | | 1721194933743 | [INFO ] LmiUtils - convert_py: [LMI TRTLLM Toolkit][152][INFO] Using llama scripts for model type: mixtral | | 1721194933743 | [INFO ] LmiUtils - convert_py: [LMI TRTLLM Toolkit][152][INFO] Compiling HuggingFace model into TensorRT engine... | | 1721194933743 | [INFO ] LmiUtils - convert_py: [LMI TRTLLM Toolkit][152][INFO] Updating TRT config... | | 1721194933743 | [INFO ] LmiUtils - convert_py: [LMI TRTLLM Toolkit][152][WARNING] The following overrides are final. Some of them are specifically set by LMI to provide the best compilation experience. | | 1721194933743 | [INFO ] LmiUtils - convert_py: [LMI TRTLLM Toolkit][152][WARNING] Model Config Override: qformat=int4_awq | | 1721194933743 | [INFO ] LmiUtils - convert_py: [LMI TRTLLM Toolkit][152][WARNING] Model Config Override: calib_size=512 | | 1721194933743 | [INFO ] LmiUtils - convert_py: [LMI TRTLLM Toolkit][152][WARNING] Model Config Override: kv_cache_dtype=int8 | | 1721194933743 | [INFO ] LmiUtils - convert_py: [LMI TRTLLM Toolkit][152][INFO] Quantizing HF checkpoint to TRT checkpoint... | | 1721194938596 | [INFO ] LmiUtils - convert_py: [LMI TRTLLM Toolkit][152][INFO] Running command: python3 /usr/local/lib/python3.10/dist-packages/tensorrt_llm_toolkit/build_scripts/quantization/quantize.py --model_dir MaziyarPanahi/Mixtral-8x22B-Instruct-v0.1-AWQ --dtype float16 --output_dir /tmp/trtllm_llama_ckpt/ --qformat int4_awq --kv_cache_dtype int8 --calib_size 512 --batch_size 32 --tp_size 4 --awq_block_size 64 | | 1721194939003 | [INFO ] LmiUtils - convert_py: [LMI TRTLLM Toolkit][152][ERROR] Exit code: 1 for command: python3 /usr/local/lib/python3.10/dist-packages/tensorrt_llm_toolkit/build_scripts/quantization/quantize.py --model_dir MaziyarPanahi/Mixtral-8x22B-Instruct-v0.1-AWQ --dtype float16 --output_dir /tmp/trtllm_llama_ckpt/ --qformat int4_awq --kv_cache_dtype int8 --calib_size 512 --batch_size 32 --tp_size 4 --awq_block_size 64 | | 1721194939003 | [INFO ] LmiUtils - convert_py: [TensorRT-LLM] TensorRT-LLM version: 0.9.0 | | 1721194939003 | [INFO ] LmiUtils - convert_py: /usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True. | | 1721194939003 | [INFO ] LmiUtils - convert_py: warnings.warn( | | 1721194939003 | [INFO ] LmiUtils - convert_py: Initializing model from MaziyarPanahi/Mixtral-8x22B-Instruct-v0.1-AWQ | | 1721194939003 | [INFO ] LmiUtils - convert_py: Traceback (most recent call last): | | 1721194939003 | [INFO ] LmiUtils - convert_py: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm_toolkit/build_scripts/quantization/quantize.py", line 52, in | | 1721194939003 | [INFO ] LmiUtils - convert_py: quantize_and_export(model_dir=args.model_dir, | | 1721194939003 | [INFO ] LmiUtils - convert_py: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/quantize_by_ammo.py", line 268, in quantize_and_export | | 1721194939003 | [INFO ] LmiUtils - convert_py: model = get_model(model_dir, dtype, device) | | 1721194939003 | [INFO ] LmiUtils - convert_py: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/quantize_by_ammo.py", line 163, in get_model | | 1721194939003 | [INFO ] LmiUtils - convert_py: model = AutoModelForCausalLM.from_pretrained(ckpt_path, | | 1721194939003 | [INFO ] LmiUtils - convert_py: File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 563, in from_pretrained | | 1721194939003 | [INFO ] LmiUtils - convert_py: return model_class.from_pretrained( | | 1721194939003 | [INFO ] LmiUtils - convert_py: File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 3155, in from_pretrained | | 1721194939003 | [INFO ] LmiUtils - convert_py: config.quantization_config = AutoHfQuantizer.merge_quantization_configs( | | 1721194939003 | [INFO ] LmiUtils - convert_py: File "/usr/local/lib/python3.10/dist-packages/transformers/quantizers/auto.py", line 149, in merge_quantization_configs | | 1721194939003 | [INFO ] LmiUtils - convert_py: quantization_config = AutoQuantizationConfig.from_dict(quantization_config) | | 1721194939003 | [INFO ] LmiUtils - convert_py: File "/usr/local/lib/python3.10/dist-packages/transformers/quantizers/auto.py", line 79, in from_dict | | 1721194939003 | [INFO ] LmiUtils - convert_py: return target_cls.from_dict(quantization_config_dict) | | 1721194939003 | [INFO ] LmiUtils - convert_py: File "/usr/local/lib/python3.10/dist-packages/transformers/utils/quantization_config.py", line 94, in from_dict | | 1721194939003 | [INFO ] LmiUtils - convert_py: config = cls(config_dict) | | 1721194939003 | [INFO ] LmiUtils - convert_py: File "/usr/local/lib/python3.10/dist-packages/transformers/utils/quantization_config.py", line 693, in init | | 1721194939003 | [INFO ] LmiUtils - convert_py: self.post_init() | | 1721194939003 | [INFO ] LmiUtils - convert_py: File "/usr/local/lib/python3.10/dist-packages/transformers/utils/quantization_config.py", line 746, in post_init | | 1721194939003 | [INFO ] LmiUtils - convert_py: raise ValueError( | | 1721194939003 | [INFO ] LmiUtils - convert_py: ValueError: You current version of autoawq does not support module quantization skipping, please upgrade autoawq package to at least 0.1.8. | | 1721194939003 | [INFO ] LmiUtils - convert_py: Traceback (most recent call last): | | 1721194939003 | [INFO ] LmiUtils - convert_py: File "/opt/djl/partition/trt_llm_partition.py", line 69, in | | 1721194939003 | [INFO ] LmiUtils - convert_py: main() | | 1721194939003 | [INFO ] LmiUtils - convert_py: File "/opt/djl/partition/trt_llm_partition.py", line 65, in main | | 1721194939003 | [INFO ] LmiUtils - convert_py: create_trt_llm_repo(properties, args) | | 1721194939003 | [INFO ] LmiUtils - convert_py: File "/opt/djl/partition/trt_llm_partition.py", line 33, in create_trt_llm_repo | | 1721194939003 | [INFO ] LmiUtils - convert_py: create_model_repo(model_id_or_path, kwargs) | | 1721194939003 | [INFO ] LmiUtils - convert_py: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm_toolkit/init.py", line 61, in create_model_repo | | 1721194939003 | [INFO ] LmiUtils - convert_py: model.compile_model() | | 1721194939003 | [INFO ] LmiUtils - convert_py: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm_toolkit/trtllmmodel/modelbuilder.py", line 128, in compile_model | | 1721194939003 | [INFO ] LmiUtils - convert_py: self.quantize_checkpoint_from_lmi_config() | | 1721194939003 | [INFO ] LmiUtils - convert_py: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm_toolkit/trtllmcheckpoint/checkpointbuilder.py", line 382, in quantize_checkpoint_from_lmi_config | | 1721194939003 | [INFO ] LmiUtils - convert_py: self.quantize_checkpoint(lmi_args) | | 1721194939003 | [INFO ] LmiUtils - convert_py: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm_toolkit/trtllmcheckpoint/checkpointbuilder.py", line 340, in quantize_checkpoint | | 1721194939003 | [INFO ] LmiUtils - convert_py: exec_command(quantize_checkpoint_cmd) | | 1721194939003 | [INFO ] LmiUtils - convert_py: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm_toolkit/utils/utils.py", line 168, in exec_command | | 1721194939003 | [INFO ] LmiUtils - convert_py: raise subprocess.CalledProcessError(proc.returncode, proc.args) | | 1721194939755 | [INFO ] LmiUtils - convert_py: subprocess.CalledProcessError: Command 'python3 /usr/local/lib/python3.10/dist-packages/tensorrt_llm_toolkit/build_scripts/quantization/quantize.py --model_dir MaziyarPanahi/Mixtral-8x22B-Instruct-v0.1-AWQ --dtype float16 --output_dir /tmp/trtllm_llama_ckpt/ --qformat int4_awq --kv_cache_dtype int8 --calib_size 512 --batch_size 32 --tp_size 4 --awq_block_size 64' returned non-zero exit status 1. | | 1721194939755 | [ERROR] ModelServer - Failed register workflow | | 1721194939755 | java.util.concurrent.CompletionException: ai.djl.engine.EngineException: Model conversion process failed! | | 1721194939755 | #011at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:315) ~[?:?] | | 1721194939755 | #011at java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:320) [?:?] | | 1721194939755 | #011at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1770) [?:?] | | 1721194939756 | #011at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.exec(CompletableFuture.java:1760) [?:?] | | 1721194939756 | #011at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373) [?:?] | | 1721194939756 | #011at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182) [?:?] | | 1721194939756 | #011at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655) [?:?] | | 1721194939756 | #011at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622) [?:?] | | 1721194939756 | #011at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165) [?:?] | | 1721194939756 | Caused by: ai.djl.engine.EngineException: Model conversion process failed! | | 1721194939756 | #011at ai.djl.serving.wlm.LmiUtils.buildTrtLlmArtifacts(LmiUtils.java:338) ~[wlm-0.28.0.jar:?] | | 1721194939756 | #011at ai.djl.serving.wlm.LmiUtils.convertTrtLLM(LmiUtils.java:133) ~[wlm-0.28.0.jar:?] | | 1721194939756 | #011at ai.djl.serving.wlm.ModelInfo.initialize(ModelInfo.java:538) ~[wlm-0.28.0.jar:?] | | 1721194939756 | #011at ai.djl.serving.models.ModelManager.lambda$registerWorkflow$2(ModelManager.java:105) ~[serving-0.28.0.jar:?] | | 1721194939756 | #011at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1768) ~[?:?] | | 1721194941759 | #011... 6 more | | 1721194941759 | [INFO ] ModelServer - Model server stopped. | | 1721194941759 | [ERROR] ModelServer - Unexpected error | | 1721194941759 | ai.djl.serving.http.ServerStartupException: Failed to initialize startup models and workflows | | 1721194941759 | #011at ai.djl.serving.ModelServer.start(ModelServer.java:210) ~[serving-0.28.0.jar:?] | | 1721194941759 | #011at ai.djl.serving.ModelServer.startAndWait(ModelServer.java:174) ~[serving-0.28.0.jar:?] | | 1721194941759 | #011at ai.djl.serving.ModelServer.main(ModelServer.java:143) [serving-0.28.0.jar:?] | | 1721194941759 | Caused by: java.util.concurrent.CompletionException: ai.djl.engine.EngineException: Model conversion process failed!

How to Reproduce?

(If you developed your own code, please provide a short script that reproduces the error. For existing examples, please provide link.)

Steps to reproduce

(Paste the commands you ran that produced the error.)

1. 2.

What have you tried to solve it?

1. 2.

Environment Info

Please run the command ./gradlew debugEnv from the root directory of DJL (if necessary, clone DJL first). It will output information about your system, environment, and installation that can help us debug your issue. Paste the output of the command below:

PASTE OUTPUT HERE
frankfliu commented 1 month ago

@ydm-amazon Please take a look.

gsjoy8888 commented 1 month ago

It seems that dji-tensorrtllm cannot convert an quantized model, not sure if it was the issue. Hence I tried mistralai/Mixtral-8x7B-Instruct-v0.1 and the conversion failed again with below message:

model = sagemaker.Model(
    image_uri=image_uri, 
    role=role,
    # specify all environment variable configs in this map
    env={
        "HF_MODEL_ID": "mistralai/Mixtral-8x7B-Instruct-v0.1",
        "TENSOR_PARALLEL_DEGREE": "max",
        "OPTION_MAX_NUM_TOKENS": "8192",
        "OPTION_QUANTIZE": "awq",
        "HF_TOKEN": "hf_xNBRqleBjkvQPxxxxxxxxxxxxxxx",
    }
)

| 1721291393048 | [INFO ] LmiUtils - convert_py: Generating train split: 71%|??????? | 203409/287113 [00:02<00:01, 77139.42 examples/s] | | 1721291393048 | [INFO ] LmiUtils - convert_py: Generating train split: 74%|???????? | 211409/287113 [00:02<00:00, 76954.50 examples/s] | | 1721291393048 | [INFO ] LmiUtils - convert_py: Generating train split: 76%|???????? | 219409/287113 [00:03<00:00, 76684.58 examples/s] | | 1721291393048 | [INFO ] LmiUtils - convert_py: Generating train split: 80%|???????? | 228409/287113 [00:03<00:00, 76042.56 examples/s] | | 1721291393048 | [INFO ] LmiUtils - convert_py: Generating train split: 82%|????????? | 236409/287113 [00:03<00:00, 74224.87 examples/s] | | 1721291393048 | [INFO ] LmiUtils - convert_py: Generating train split: 85%|????????? | 244409/287113 [00:03<00:00, 72637.33 examples/s] | | 1721291393048 | [INFO ] LmiUtils - convert_py: Generating train split: 88%|????????? | 252409/287113 [00:03<00:00, 71742.55 examples/s] | | 1721291393048 | [INFO ] LmiUtils - convert_py: Generating train split: 91%|????????? | 260409/287113 [00:03<00:00, 70389.71 examples/s] | | 1721291393048 | [INFO ] LmiUtils - convert_py: Generating train split: 93%|??????????| 268409/287113 [00:03<00:00, 70755.62 examples/s] | | 1721291393048 | [INFO ] LmiUtils - convert_py: Generating train split: 96%|??????????| 276409/287113 [00:03<00:00, 71027.88 examples/s] | | 1721291393048 | [INFO ] LmiUtils - convert_py: Generating train split: 99%|??????????| 284409/287113 [00:03<00:00, 69815.91 examples/s] | | 1721291393048 | [INFO ] LmiUtils - convert_py: Generating train split: 100%|??????????| 287113/287113 [00:03<00:00, 72160.21 examples/s] | | 1721291393048 | [INFO ] LmiUtils - convert_py: | | 1721291393048 | [INFO ] LmiUtils - convert_py: Generating validation split: 0%| | 0/13368 [00:00<?, ? examples/s] | | 1721291393048 | [INFO ] LmiUtils - convert_py: Generating validation split: 67%|??????? | 9000/13368 [00:00<00:00, 76879.14 examples/s] | | 1721291393048 | [INFO ] LmiUtils - convert_py: Generating validation split: 100%|??????????| 13368/13368 [00:00<00:00, 73118.40 examples/s] | | 1721291393048 | [INFO ] LmiUtils - convert_py: | | 1721291393048 | [INFO ] LmiUtils - convert_py: Generating test split: 0%| | 0/11490 [00:00<?, ? examples/s] | | 1721291393048 | [INFO ] LmiUtils - convert_py: Generating test split: 78%|???????? | 9000/11490 [00:00<00:00, 75469.55 examples/s] | | 1721291393048 | [INFO ] LmiUtils - convert_py: Generating test split: 100%|??????????| 11490/11490 [00:00<00:00, 72635.15 examples/s] | | 1721291393048 | [INFO ] LmiUtils - convert_py: {'quant_cfg': {'weight_quantizer': {'num_bits': 4, 'block_sizes': {-1: 64}, 'enable': True}, 'input_quantizer': {'enable': False}, 'lm_head': {'enable': False}, 'output_layer': {'enable': False}, 'default': {'enable': False}, '.query_key_value.output_quantizer': {'num_bits': 8, 'axis': None, 'enable': True}, '.Wqkv.output_quantizer': {'num_bits': 8, 'axis': None, 'enable': True}, '.W_pack.output_quantizer': {'num_bits': 8, 'axis': None, 'enable': True}, '.c_attn.output_quantizer': {'num_bits': 8, 'axis': None, 'enable': True}, '.k_proj.output_quantizer': {'num_bits': 8, 'axis': None, 'enable': True}, '.v_proj.output_quantizer': {'num_bits': 8, 'axis': None, 'enable': True}}, 'algorithm': {'method': 'awq_lite', 'alpha_step': 0.1}} | | 1721291393048 | [INFO ] LmiUtils - convert_py: Starting quantization... | | 1721291393048 | [INFO ] LmiUtils - convert_py: Replaced 2787 modules to quantized modules | | 1721291393048 | [INFO ] LmiUtils - convert_py: Caching activation statistics for awq_lite... | | 1721291393048 | [INFO ] LmiUtils - convert_py: Calibrating batch 0 | | 1721291393048 | [INFO ] LmiUtils - convert_py: Loading extension ammo_cuda_ext... | | 1721291393048 | [INFO ] LmiUtils - convert_py: Loading extension ammo_cuda_ext_fp8... | | 1721291393048 | [INFO ] LmiUtils - convert_py: /usr/local/lib/python3.10/dist-packages/ammo/torch/quantization/nn/modules/tensor_quantizer.py:153: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requiresgrad(True), rather than torch.tensor(sourceTensor). | | 1721291393048 | [INFO ] LmiUtils - convert_py: self.register_buffer("_pre_quant_scale", torch.tensor(value)) | | 1721291393048 | [INFO ] LmiUtils - convert_py: /usr/local/lib/python3.10/dist-packages/ammo/torch/quantization/nn/modules/tensor_quantizer.py:155: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requiresgrad(True), rather than torch.tensor(sourceTensor). | | 1721291393048 | [INFO ] LmiUtils - convert_py: value = torch.tensor(value, device=self._pre_quant_scale.device) | | 1721291393048 | [INFO ] LmiUtils - convert_py: /usr/local/lib/python3.10/dist-packages/numpy/lib/format.py:362: UserWarning: metadata on a dtype is not saved to an npy/npz. Use another format (such as pickle) to store it. | | 1721291393048 | [INFO ] LmiUtils - convert_py: d['descr'] = dtype_to_descr(array.dtype) | | 1721291393048 | [INFO ] LmiUtils - convert_py: Searching awq_lite parameters... | | 1721291393049 | [INFO ] LmiUtils - convert_py: Calibrating batch 0 | | 1721291393049 | [INFO ] LmiUtils - convert_py: Calibrating batch 0 | | 1721291393049 | [INFO ] LmiUtils - convert_py: Quantization done. Total time used: 257.33 s. | | 1721291393049 | [INFO ] LmiUtils - convert_py: Unknown model type MixtralForCausalLM. Continue exporting... | | 1721291393049 | [INFO ] LmiUtils - convert_py: Warning: export_npz is going to be deprecated soon and replaced by safetensors. | | 1721291393049 | [INFO ] LmiUtils - convert_py: torch.distributed not initialized, assuming single world_size. | | 1721291393049 | [INFO ] LmiUtils - convert_py: torch.distributed not initialized, assuming single world_size. | | 1721291393049 | [INFO ] LmiUtils - convert_py: torch.distributed not initialized, assuming single world_size. | | 1721291393049 | [INFO ] LmiUtils - convert_py: torch.distributed not initialized, assuming single world_size. | | 1721291393049 | [INFO ] LmiUtils - convert_py: torch.distributed not initialized, assuming single world_size. | | 1721291393049 | [INFO ] LmiUtils - convert_py: torch.distributed not initialized, assuming single world_size. | | 1721291393049 | [INFO ] LmiUtils - convert_py: torch.distributed not initialized, assuming single world_size. | | 1721291393049 | [INFO ] LmiUtils - convert_py: torch.distributed not initialized, assuming single world_size. | | 1721291393049 | [INFO ] LmiUtils - convert_py: current rank: 0, tp rank: 0, pp rank: 0 | | 1721291393049 | [INFO ] LmiUtils - convert_py: torch.distributed not initialized, assuming single world_size. | | 1721291393049 | [INFO ] LmiUtils - convert_py: Warning: this is an old NPZ format and will be deprecated soon. | | 1721291393049 | [INFO ] LmiUtils - convert_py: Warning: this is an old NPZ format and will be deprecated soon. | | 1721291393049 | [INFO ] LmiUtils - convert_py: Warning: this is an old NPZ format and will be deprecated soon. | | 1721291393049 | [INFO ] LmiUtils - convert_py: Warning: this is an old NPZ format and will be deprecated soon. | | 1721291393049 | [INFO ] LmiUtils - convert_py: Warning: this is an old NPZ format and will be deprecated soon. | | 1721291393049 | [INFO ] LmiUtils - convert_py: Warning: this is an old NPZ format and will be deprecated soon. | | 1721291393049 | [INFO ] LmiUtils - convert_py: Warning: this is an old NPZ format and will be deprecated soon. | | 1721291393049 | [INFO ] LmiUtils - convert_py: Warning: this is an old NPZ format and will be deprecated soon. | | 1721291393049 | [INFO ] LmiUtils - convert_py: Traceback (most recent call last): | | 1721291393049 | [INFO ] LmiUtils - convert_py: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm_toolkit/build_scripts/quantization/quantize.py", line 52, in | | 1721291393049 | [INFO ] LmiUtils - convert_py: quantize_and_export(model_dir=args.model_dir, | | 1721291393049 | [INFO ] LmiUtils - convert_py: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/quantize_by_ammo.py", line 360, in quantize_and_export | | 1721291393049 | [INFO ] LmiUtils - convert_py: with safetensors.safe_open(f"{export_path}/rank0.safetensors", | | 1721291393049 | [INFO ] LmiUtils - convert_py: FileNotFoundError: No such file or directory: "/tmp/trtllm_llama_ckpt//rank0.safetensors" | | 1721291393049 | [INFO ] LmiUtils - convert_py: Traceback (most recent call last): | | 1721291393049 | [INFO ] LmiUtils - convert_py: File "/opt/djl/partition/trt_llm_partition.py", line 69, in | | 1721291393049 | [INFO ] LmiUtils - convert_py: main() | | 1721291393049 | [INFO ] LmiUtils - convert_py: File "/opt/djl/partition/trt_llm_partition.py", line 65, in main | | 1721291393049 | [INFO ] LmiUtils - convert_py: create_trt_llm_repo(properties, args) | | 1721291393049 | [INFO ] LmiUtils - convert_py: File "/opt/djl/partition/trt_llm_partition.py", line 33, in create_trt_llm_repo | | 1721291393049 | [INFO ] LmiUtils - convert_py: create_model_repo(model_id_or_path, **kwargs) | | 1721291393049 | [INFO ] LmiUtils - convert_py: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm_toolkit/init.py", line 61, in create_model_repo | | 1721291393049 | [INFO ] LmiUtils - convert_py: model.compile_model() | | 1721291393049 | [INFO ] LmiUtils - convert_py: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm_toolkit/trtllmmodel/modelbuilder.py", line 128, in compile_model | | 1721291393049 | [INFO ] LmiUtils - convert_py: self.quantize_checkpoint_from_lmi_config() | | 1721291393049 | [INFO ] LmiUtils - convert_py: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm_toolkit/trtllmcheckpoint/checkpointbuilder.py", line 382, in quantize_checkpoint_from_lmi_config | | 1721291393049 | [INFO ] LmiUtils - convert_py: self.quantize_checkpoint(lmi_args) | | 1721291393049 | [INFO ] LmiUtils - convert_py: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm_toolkit/trtllmcheckpoint/checkpointbuilder.py", line 340, in quantize_checkpoint | | 1721291393049 | [INFO ] LmiUtils - convert_py: exec_command(quantize_checkpoint_cmd) | | 1721291393049 | [INFO ] LmiUtils - convert_py: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm_toolkit/utils/utils.py", line 168, in exec_command | | 1721291393049 | [INFO ] LmiUtils - convert_py: raise subprocess.CalledProcessError(proc.returncode, proc.args) | | 1721291394303 | [INFO ] LmiUtils - convert_py: subprocess.CalledProcessError: Command 'python3 /usr/local/lib/python3.10/dist-packages/tensorrt_llm_toolkit/build_scripts/quantization/quantize.py --model_dir mistralai/Mixtral-8x7B-Instruct-v0.1 --dtype float16 --output_dir /tmp/trtllm_llama_ckpt/ --qformat int4_awq --kv_cache_dtype int8 --calib_size 512 --batch_size 32 --tp_size 8 --awq_block_size 64' returned non-zero exit status 1. | | 1721291394303 | [ERROR] ModelServer - Failed register workflow | | 1721291394303 | java.util.concurrent.CompletionException: ai.djl.engine.EngineException: Model conversion process failed! | | 1721291394303 | #011at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:315) ~[?:?] | | 1721291394303 | #011at java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:320) [?:?] | | 1721291394303 | #011at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1770) [?:?] | | 1721291394303 | #011at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.exec(CompletableFuture.java:1760) [?:?] | | 1721291394303 | #011at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373) [?:?] | | 1721291394303 | #011at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182) [?:?] | | 1721291394303 | #011at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655) [?:?] | | 1721291394303 | #011at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622) [?:?] | | 1721291394303 | #011at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165) [?:?] | | 1721291394303 | Caused by: ai.djl.engine.EngineException: Model conversion process failed! | | 1721291394303 | #011at ai.djl.serving.wlm.LmiUtils.buildTrtLlmArtifacts(LmiUtils.java:338) ~[wlm-0.28.0.jar:?] | | 1721291394303 | #011at ai.djl.serving.wlm.LmiUtils.convertTrtLLM(LmiUtils.java:133) ~[wlm-0.28.0.jar:?] | | 1721291394303 | #011at ai.djl.serving.wlm.ModelInfo.initialize(ModelInfo.java:538) ~[wlm-0.28.0.jar:?] | | 1721291394303 | #011at ai.djl.serving.models.ModelManager.lambda$registerWorkflow$2(ModelManager.java:105) ~[serving-0.28.0.jar:?] | | 1721291394303 | #011at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1768) ~[?:?] | | 1721291396810 | #011... 6 more | | 1721291396810 | [INFO ] ModelServer - Model server stopped. | | 1721291396810 | [ERROR] ModelServer - Unexpected error | | 1721291396810 | ai.djl.serving.http.ServerStartupException: Failed to initialize startup models and workflows | | 1721291396810 | #011at ai.djl.serving.ModelServer.start(ModelServer.java:210) ~[serving-0.28.0.jar:?] | | 1721291396810 | #011at ai.djl.serving.ModelServer.startAndWait(ModelServer.java:174) ~[serving-0.28.0.jar:?] | | 1721291396810 | #011at ai.djl.serving.ModelServer.main(ModelServer.java:143) [serving-0.28.0.jar:?] | | 1721291396810 | Caused by: java.util.concurrent.CompletionException: ai.djl.engine.EngineException: Model conversion process failed! | | 1721291396810 | #011at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:315) ~[?:?] | | 1721291396810 | #011at java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:320) ~[?:?] | | 1721291396810 | #011at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1770) ~[?:?] | | 1721291396810 | #011at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.exec(CompletableFuture.java:1760) ~[?:?] | | 1721291396810 | #011at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373) ~[?:?] | | 1721291396810 | #011at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182) ~[?:?] | | 1721291396810 | #011at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655) ~[?:?] | | 1721291396810 | #011at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622) ~[?:?] | | 1721291396810 | #011at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165) ~[?:?] | | 1721291396810 | Caused by: ai.djl.engine.EngineException: Model conversion process failed! | | 1721291396810 | #011at ai.djl.serving.wlm.LmiUtils.buildTrtLlmArtifacts(LmiUtils.java:338) ~[wlm-0.28.0.jar:?] | | 1721291396810 | #011at ai.djl.serving.wlm.LmiUtils.convertTrtLLM(LmiUtils.java:133) ~[wlm-0.28.0.jar:?] | | 1721291396810 | #011at ai.djl.serving.wlm.ModelInfo.initialize(ModelInfo.java:538) ~[wlm-0.28.0.jar:?] | | 1721291396810 | #011at ai.djl.serving.models.ModelManager.lambda$registerWorkflow$2(ModelManager.java:105) ~[serving-0.28.0.jar:?] | | 1721291396810 | #011at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1768) ~[?:?] | | 1721291396810 | #011at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.exec(CompletableFuture.java:1760) ~[?:?] | | 1721291396810 | #011at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373) ~[?:?] | | 1721291396810 | #011at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182) ~[?:?] | | 1721291396810 | #011at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655) ~[?:?] | | 1721291396810 | #011at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622) ~[?:?] | | 1721291398098 | #011at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165) ~[?:?]

ydm-amazon commented 1 month ago

Thanks for the detailed information; I will look into it more today!