working with VLLM - Githubissues

I'm wondering if I can get an easier pipeline by loading the awq weights with vllm:

from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is"
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
model_id = 'Efficient-Large-Model/VILA1.5-13b-AWQ'

llm = LLM(model=model_id, quantization="awq", dtype="half")

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

The first issue seems to be that the config.json is trying to use a model type called llava_llama, which transformers doesn't know about.

/home/ray/anaconda3/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 945, in from_pretrained
    config_class = CONFIG_MAPPING[config_dict["model_type"]]
  File "/home/ray/anaconda3/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 647, in __getitem__
    raise KeyError(key)
KeyError: 'llava_llama'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "//testvllm.py", line 13, in <module>
    llm = LLM(model=model_id, quantization="awq", dtype="half")
  File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 123, in __init__
    self.llm_engine = LLMEngine.from_engine_args(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 272, in from_engine_args
    engine_config = engine_args.create_engine_config()
  File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 520, in create_engine_config
    model_config = ModelConfig(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/config.py", line 121, in __init__
    self.hf_config = get_config(self.model, trust_remote_code, revision,
  File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/transformers_utils/config.py", line 38, in get_config
    raise e
  File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/transformers_utils/config.py", line 23, in get_config
    config = AutoConfig.from_pretrained(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 947, in from_pretrained
    raise ValueError(
ValueError: The checkpoint you are trying to load has model type `llava_llama` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.

if I change the type in config.json to just llava I get:

/home/ray/anaconda3/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
WARNING 05-09 09:38:26 config.py:205] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 05-09 09:38:26 llm_engine.py:100] Initializing an LLM engine (v0.4.2) with config: model='Efficient-Large-Model/VILA1.5-13b-AWQ', speculative_config=None, tokenizer='Efficient-Large-Model/VILA1.5-13b-AWQ', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=awq, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=Efficient-Large-Model/VILA1.5-13b-AWQ)
Traceback (most recent call last):
  File "//testvllm.py", line 13, in <module>
    llm = LLM(model=model_id, quantization="awq", dtype="half")
  File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 123, in __init__
    self.llm_engine = LLMEngine.from_engine_args(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 292, in from_engine_args
    engine = cls(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 150, in __init__
    self._init_tokenizer()
  File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 328, in _init_tokenizer
    self.tokenizer = get_tokenizer_group(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/transformers_utils/tokenizer_group/__init__.py", line 20, in get_tokenizer_group
    return TokenizerGroup(**init_kwargs)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/transformers_utils/tokenizer_group/tokenizer_group.py", line 23, in __init__
    self.tokenizer = get_tokenizer(self.tokenizer_id, **tokenizer_config)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/transformers_utils/tokenizer.py", line 92, in get_tokenizer
    tokenizer = AutoTokenizer.from_pretrained(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 880, in from_pretrained
    return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2073, in from_pretrained
    raise EnvironmentError(
OSError: Can't load tokenizer for 'Efficient-Large-Model/VILA1.5-13b-AWQ'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'Efficient-Large-Model/VILA1.5-13b-AWQ' is the correct path to a directory containing all relevant files for a LlamaTokenizerFast tokenizer.

Which seems to suggest that the LLama tokenizer isn't in the llm directory? Do we need a tokenizer.json in the repo? Even if I add that, it seems to have trouble loading the tokenizer.

Efficient-Large-Model / VILA

working with VLLM #53