Open ccamacho opened 4 months ago
Describe the bug
There is a misleading error when deploying models in small MIG partitions
To Reproduce
Expected output
Have the inference service running or having a detailed error in TGIS about why the model is not working.
Actual error
[2m2024-06-24T10:33:54.484964Z[0m [32m INFO[0m [2mtext_generation_launcher[0m[2m:[0m TGIS Commit hash: [2m2024-06-24T10:33:54.484984Z[0m [32m INFO[0m [2mtext_generation_launcher[0m[2m:[0m Launcher args: Args { model_name: "/mnt/models/", revision: None, deployment_framework: "hf_transformers", dtype: None, dtype_str: None, quantize: None, num_shard: None, max_concurrent_requests: 512, max_sequence_length: Some(448), max_new_tokens: 384, max_batch_size: 64, max_prefill_padding: 0.2, batch_safety_margin: 20, max_waiting_tokens: 24, port: 3000, grpc_port: 8033, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, json_output: false, tls_cert_path: None, tls_key_path: None, tls_client_ca_cert_path: None, output_special_tokens: false, cuda_process_memory_fraction: 1.0, default_include_stop_seqs: true, otlp_endpoint: None, otlp_service_name: None } [2m2024-06-24T10:33:54.484997Z[0m [32m INFO[0m [2mtext_generation_launcher[0m[2m:[0m Inferring num_shard = 1 from CUDA_VISIBLE_DEVICES/NVIDIA_VISIBLE_DEVICES [2m2024-06-24T10:33:54.485049Z[0m [32m INFO[0m [2mtext_generation_launcher[0m[2m:[0m Saving fast tokenizer for `/mnt/models/` to `/tmp/74657ff2-73b1-45f2-b8d5-a7302a63f862` /opt/tgis/lib/python3.11/site-packages/transformers/utils/hub.py:124: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead. warnings.warn( [2m2024-06-24T10:33:56.397996Z[0m [32m INFO[0m [2mtext_generation_launcher[0m[2m:[0m Using configured max_sequence_length: 448 [2m2024-06-24T10:33:56.398022Z[0m [32m INFO[0m [2mtext_generation_launcher[0m[2m:[0m Setting PYTORCH_CUDA_ALLOC_CONF to default value: expandable_segments:True [2m2024-06-24T10:33:56.398340Z[0m [32m INFO[0m [2mtext_generation_launcher[0m[2m:[0m Starting shard 0 Shard 0: /opt/tgis/lib/python3.11/site-packages/transformers/utils/hub.py:124: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead. Shard 0: warnings.warn( Shard 0: HAS_BITS_AND_BYTES=False, HAS_GPTQ_CUDA=True, EXLLAMA_VERSION=2, GPTQ_CUDA_TYPE=exllama Shard 0: supports_causal_lm = True, supports_seq2seq_lm = False Shard 0: Traceback (most recent call last): Shard 0: Shard 0: File "/opt/tgis/bin/text-generation-server", line 8, in <module> Shard 0: sys.exit(app()) Shard 0: ^^^^^ Shard 0: Shard 0: File "/opt/tgis/lib/python3.11/site-packages/text_generation_server/cli.py", line 75, in serve Shard 0: raise e Shard 0: Shard 0: File "/opt/tgis/lib/python3.11/site-packages/text_generation_server/cli.py", line 56, in serve Shard 0: server.serve( Shard 0: Shard 0: File "/opt/tgis/lib/python3.11/site-packages/text_generation_server/server.py", line 388, in serve Shard 0: asyncio.run( Shard 0: Shard 0: File "/opt/tgis/lib/python3.11/asyncio/runners.py", line 190, in run Shard 0: return runner.run(main) Shard 0: ^^^^^^^^^^^^^^^^ Shard 0: Shard 0: File "/opt/tgis/lib/python3.11/asyncio/runners.py", line 118, in run Shard 0: return self._loop.run_until_complete(task) Shard 0: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Shard 0: Shard 0: File "/opt/tgis/lib/python3.11/asyncio/base_events.py", line 654, in run_until_complete Shard 0: return future.result() Shard 0: ^^^^^^^^^^^^^^^ Shard 0: Shard 0: File "/opt/tgis/lib/python3.11/site-packages/text_generation_server/server.py", line 267, in serve_inner Shard 0: model = get_model( Shard 0: ^^^^^^^^^^ Shard 0: Shard 0: File "/opt/tgis/lib/python3.11/site-packages/text_generation_server/models/__init__.py", line 126, in get_model Shard 0: return CausalLM(model_name, revision, deployment_framework, dtype, quantize, model_config, max_sequence_length) Shard 0: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Shard 0: Shard 0: File "/opt/tgis/lib/python3.11/site-packages/text_generation_server/models/causal_lm.py", line 558, in __init__ Shard 0: inference_engine = get_inference_engine_class(deployment_framework)( Shard 0: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Shard 0: Shard 0: File "/opt/tgis/lib/python3.11/site-packages/text_generation_server/inference_engine/hf_transformers.py", line 76, in __init__ Shard 0: self.model = model_class.from_pretrained(**kwargs).requires_grad_(False).eval() Shard 0: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Shard 0: Shard 0: File "/opt/tgis/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 561, in from_pretrained Shard 0: return model_class.from_pretrained( Shard 0: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Shard 0: Shard 0: File "/opt/tgis/lib/python3.11/site-packages/transformers/modeling_utils.py", line 3375, in from_pretrained Shard 0: model = cls(config, *model_args, **model_kwargs) Shard 0: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Shard 0: Shard 0: File "/opt/tgis/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 1103, in __init__ Shard 0: self.model = LlamaModel(config) Shard 0: ^^^^^^^^^^^^^^^^^^ Shard 0: Shard 0: File "/opt/tgis/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 924, in __init__ Shard 0: [LlamaDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)] Shard 0: Shard 0: File "/opt/tgis/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 924, in <listcomp> Shard 0: [LlamaDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)] Shard 0: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Shard 0: Shard 0: File "/opt/tgis/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 701, in __init__ Shard 0: self.mlp = LlamaMLP(config) Shard 0: ^^^^^^^^^^^^^^^^ Shard 0: Shard 0: File "/opt/tgis/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 219, in __init__ Shard 0: self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) Shard 0: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Shard 0: Shard 0: File "/opt/tgis/lib/python3.11/site-packages/torch/nn/modules/linear.py", line 98, in __init__ Shard 0: self.weight = Parameter(torch.empty((out_features, in_features), **factory_kwargs)) Shard 0: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Shard 0: Shard 0: File "/opt/tgis/lib/python3.11/site-packages/torch/utils/_device.py", line 77, in __torch_function__ Shard 0: return func(*args, **kwargs) Shard 0: ^^^^^^^^^^^^^^^^^^^^^ Shard 0: Shard 0: RuntimeError: NVML_SUCCESS == r INTERNAL ASSERT FAILED at "../c10/cuda/CUDACachingAllocator.cpp":830, please report a bug to PyTorch. Shard 0: [2m2024-06-24T10:34:00.379801Z[0m [31mERROR[0m [2mtext_generation_launcher[0m[2m:[0m Shard 0 failed: ExitStatus(unix_wait_status(256)) [2m2024-06-24T10:34:00.400918Z[0m [32m INFO[0m [2mtext_generation_launcher[0m[2m:[0m Shutting down shards
Workaround
Having the model deployed in a bigger partition.
🤸♀️
Describe the bug
There is a misleading error when deploying models in small MIG partitions
To Reproduce
Expected output
Have the inference service running or having a detailed error in TGIS about why the model is not working.
Actual error
Workaround
Having the model deployed in a bigger partition.