IBM / text-generation-inference

IBM development fork of https://github.com/huggingface/text-generation-inference
Apache License 2.0
57 stars 30 forks source link

Problem loading granite-3b in small MIG partitions #104

Open ccamacho opened 4 months ago

ccamacho commented 4 months ago

Describe the bug

There is a misleading error when deploying models in small MIG partitions

To Reproduce

Expected output

Have the inference service running or having a detailed error in TGIS about why the model is not working.

Actual error

2024-06-24T10:33:54.484964Z  INFO text_generation_launcher: TGIS Commit hash: 
2024-06-24T10:33:54.484984Z  INFO text_generation_launcher: Launcher args: Args { model_name: "/mnt/models/", revision: None, deployment_framework: "hf_transformers", dtype: None, dtype_str: None, quantize: None, num_shard: None, max_concurrent_requests: 512, max_sequence_length: Some(448), max_new_tokens: 384, max_batch_size: 64, max_prefill_padding: 0.2, batch_safety_margin: 20, max_waiting_tokens: 24, port: 3000, grpc_port: 8033, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, json_output: false, tls_cert_path: None, tls_key_path: None, tls_client_ca_cert_path: None, output_special_tokens: false, cuda_process_memory_fraction: 1.0, default_include_stop_seqs: true, otlp_endpoint: None, otlp_service_name: None }
2024-06-24T10:33:54.484997Z  INFO text_generation_launcher: Inferring num_shard = 1 from CUDA_VISIBLE_DEVICES/NVIDIA_VISIBLE_DEVICES
2024-06-24T10:33:54.485049Z  INFO text_generation_launcher: Saving fast tokenizer for `/mnt/models/` to `/tmp/74657ff2-73b1-45f2-b8d5-a7302a63f862`
/opt/tgis/lib/python3.11/site-packages/transformers/utils/hub.py:124: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
2024-06-24T10:33:56.397996Z  INFO text_generation_launcher: Using configured max_sequence_length: 448
2024-06-24T10:33:56.398022Z  INFO text_generation_launcher: Setting PYTORCH_CUDA_ALLOC_CONF to default value: expandable_segments:True
2024-06-24T10:33:56.398340Z  INFO text_generation_launcher: Starting shard 0
Shard 0: /opt/tgis/lib/python3.11/site-packages/transformers/utils/hub.py:124: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
Shard 0:   warnings.warn(
Shard 0: HAS_BITS_AND_BYTES=False, HAS_GPTQ_CUDA=True, EXLLAMA_VERSION=2, GPTQ_CUDA_TYPE=exllama
Shard 0: supports_causal_lm = True, supports_seq2seq_lm = False
Shard 0: Traceback (most recent call last):
Shard 0: 
Shard 0:   File "/opt/tgis/bin/text-generation-server", line 8, in <module>
Shard 0:     sys.exit(app())
Shard 0:              ^^^^^
Shard 0: 
Shard 0:   File "/opt/tgis/lib/python3.11/site-packages/text_generation_server/cli.py", line 75, in serve
Shard 0:     raise e
Shard 0: 
Shard 0:   File "/opt/tgis/lib/python3.11/site-packages/text_generation_server/cli.py", line 56, in serve
Shard 0:     server.serve(
Shard 0: 
Shard 0:   File "/opt/tgis/lib/python3.11/site-packages/text_generation_server/server.py", line 388, in serve
Shard 0:     asyncio.run(
Shard 0: 
Shard 0:   File "/opt/tgis/lib/python3.11/asyncio/runners.py", line 190, in run
Shard 0:     return runner.run(main)
Shard 0:            ^^^^^^^^^^^^^^^^
Shard 0: 
Shard 0:   File "/opt/tgis/lib/python3.11/asyncio/runners.py", line 118, in run
Shard 0:     return self._loop.run_until_complete(task)
Shard 0:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Shard 0: 
Shard 0:   File "/opt/tgis/lib/python3.11/asyncio/base_events.py", line 654, in run_until_complete
Shard 0:     return future.result()
Shard 0:            ^^^^^^^^^^^^^^^
Shard 0: 
Shard 0:   File "/opt/tgis/lib/python3.11/site-packages/text_generation_server/server.py", line 267, in serve_inner
Shard 0:     model = get_model(
Shard 0:             ^^^^^^^^^^
Shard 0: 
Shard 0:   File "/opt/tgis/lib/python3.11/site-packages/text_generation_server/models/__init__.py", line 126, in get_model
Shard 0:     return CausalLM(model_name, revision, deployment_framework, dtype, quantize, model_config, max_sequence_length)
Shard 0:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Shard 0: 
Shard 0:   File "/opt/tgis/lib/python3.11/site-packages/text_generation_server/models/causal_lm.py", line 558, in __init__
Shard 0:     inference_engine = get_inference_engine_class(deployment_framework)(
Shard 0:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Shard 0: 
Shard 0:   File "/opt/tgis/lib/python3.11/site-packages/text_generation_server/inference_engine/hf_transformers.py", line 76, in __init__
Shard 0:     self.model = model_class.from_pretrained(**kwargs).requires_grad_(False).eval()
Shard 0:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Shard 0: 
Shard 0:   File "/opt/tgis/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 561, in from_pretrained
Shard 0:     return model_class.from_pretrained(
Shard 0:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Shard 0: 
Shard 0:   File "/opt/tgis/lib/python3.11/site-packages/transformers/modeling_utils.py", line 3375, in from_pretrained
Shard 0:     model = cls(config, *model_args, **model_kwargs)
Shard 0:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Shard 0: 
Shard 0:   File "/opt/tgis/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 1103, in __init__
Shard 0:     self.model = LlamaModel(config)
Shard 0:                  ^^^^^^^^^^^^^^^^^^
Shard 0: 
Shard 0:   File "/opt/tgis/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 924, in __init__
Shard 0:     [LlamaDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
Shard 0: 
Shard 0:   File "/opt/tgis/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 924, in <listcomp>
Shard 0:     [LlamaDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
Shard 0:      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Shard 0: 
Shard 0:   File "/opt/tgis/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 701, in __init__
Shard 0:     self.mlp = LlamaMLP(config)
Shard 0:                ^^^^^^^^^^^^^^^^
Shard 0: 
Shard 0:   File "/opt/tgis/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 219, in __init__
Shard 0:     self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
Shard 0:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Shard 0: 
Shard 0:   File "/opt/tgis/lib/python3.11/site-packages/torch/nn/modules/linear.py", line 98, in __init__
Shard 0:     self.weight = Parameter(torch.empty((out_features, in_features), **factory_kwargs))
Shard 0:                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Shard 0: 
Shard 0:   File "/opt/tgis/lib/python3.11/site-packages/torch/utils/_device.py", line 77, in __torch_function__
Shard 0:     return func(*args, **kwargs)
Shard 0:            ^^^^^^^^^^^^^^^^^^^^^
Shard 0: 
Shard 0: RuntimeError: NVML_SUCCESS == r INTERNAL ASSERT FAILED at "../c10/cuda/CUDACachingAllocator.cpp":830, please report a bug to PyTorch. 
Shard 0: 
2024-06-24T10:34:00.379801Z ERROR text_generation_launcher: Shard 0 failed: ExitStatus(unix_wait_status(256))
2024-06-24T10:34:00.400918Z  INFO text_generation_launcher: Shutting down shards

Workaround

Having the model deployed in a bigger partition.

sumaiya1996 commented 1 month ago

🤸‍♀️