Efficient-Large-Model / VILA

VILA - a multi-image visual language model with training, inference and evaluation recipe, deployable from cloud to edge (Jetson Orin and laptops)
Apache License 2.0
879 stars 55 forks source link

working with VLLM #53

Open kousun12 opened 1 month ago

kousun12 commented 1 month ago

I'm wondering if I can get an easier pipeline by loading the awq weights with vllm:

from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is"
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
model_id = 'Efficient-Large-Model/VILA1.5-13b-AWQ'

llm = LLM(model=model_id, quantization="awq", dtype="half")

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

The first issue seems to be that the config.json is trying to use a model type called llava_llama, which transformers doesn't know about.

/home/ray/anaconda3/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 945, in from_pretrained
    config_class = CONFIG_MAPPING[config_dict["model_type"]]
  File "/home/ray/anaconda3/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 647, in __getitem__
    raise KeyError(key)
KeyError: 'llava_llama'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "//testvllm.py", line 13, in <module>
    llm = LLM(model=model_id, quantization="awq", dtype="half")
  File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 123, in __init__
    self.llm_engine = LLMEngine.from_engine_args(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 272, in from_engine_args
    engine_config = engine_args.create_engine_config()
  File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 520, in create_engine_config
    model_config = ModelConfig(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/config.py", line 121, in __init__
    self.hf_config = get_config(self.model, trust_remote_code, revision,
  File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/transformers_utils/config.py", line 38, in get_config
    raise e
  File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/transformers_utils/config.py", line 23, in get_config
    config = AutoConfig.from_pretrained(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 947, in from_pretrained
    raise ValueError(
ValueError: The checkpoint you are trying to load has model type `llava_llama` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.

if I change the type in config.json to just llava I get:

/home/ray/anaconda3/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
WARNING 05-09 09:38:26 config.py:205] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 05-09 09:38:26 llm_engine.py:100] Initializing an LLM engine (v0.4.2) with config: model='Efficient-Large-Model/VILA1.5-13b-AWQ', speculative_config=None, tokenizer='Efficient-Large-Model/VILA1.5-13b-AWQ', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=awq, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=Efficient-Large-Model/VILA1.5-13b-AWQ)
Traceback (most recent call last):
  File "//testvllm.py", line 13, in <module>
    llm = LLM(model=model_id, quantization="awq", dtype="half")
  File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 123, in __init__
    self.llm_engine = LLMEngine.from_engine_args(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 292, in from_engine_args
    engine = cls(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 150, in __init__
    self._init_tokenizer()
  File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 328, in _init_tokenizer
    self.tokenizer = get_tokenizer_group(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/transformers_utils/tokenizer_group/__init__.py", line 20, in get_tokenizer_group
    return TokenizerGroup(**init_kwargs)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/transformers_utils/tokenizer_group/tokenizer_group.py", line 23, in __init__
    self.tokenizer = get_tokenizer(self.tokenizer_id, **tokenizer_config)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/transformers_utils/tokenizer.py", line 92, in get_tokenizer
    tokenizer = AutoTokenizer.from_pretrained(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 880, in from_pretrained
    return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2073, in from_pretrained
    raise EnvironmentError(
OSError: Can't load tokenizer for 'Efficient-Large-Model/VILA1.5-13b-AWQ'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'Efficient-Large-Model/VILA1.5-13b-AWQ' is the correct path to a directory containing all relevant files for a LlamaTokenizerFast tokenizer.

Which seems to suggest that the LLama tokenizer isn't in the llm directory? Do we need a tokenizer.json in the repo? Even if I add that, it seems to have trouble loading the tokenizer.

ys-2020 commented 1 month ago

Hi @kousun12 , thanks for your interest in VILA! For the first question, I am wondering what is the version of your transformers, and how do you install VILA? The model arch llava_llama should be already defined if you have installed VILA and the right version of transformers. For the second question, tokenizer.json is here, under the llm folder rather than the root of the model. You may need to modify the code for loading VILA to solve the problem. And please note that you need to make sure that the VILA model is served with newest TinyChat backends.