InternLM / lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
https://lmdeploy.readthedocs.io/en/latest/
Apache License 2.0
4.6k stars 420 forks source link

Could not use my local internVL mini model for inference #1990

Open shiva-vardhineedi opened 4 months ago

shiva-vardhineedi commented 4 months ago

Checklist

Describe the bug

Looks like lmdeploy as default downloads models from Hugging face and there is a mention of how to make it download form ModelSource but there is no mention of how to use my locally available model. I have a finetuned internVL mini model and am referencing it in pipeline like below:

import nest_asyncio
nest_asyncio.apply()
from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig, GenerationConfig
from lmdeploy.vl import load_image

model = 'home/joy/InternVL/my-local-model'
chat_template_config = ChatTemplateConfig('internvl-internlm2')
pipe = pipeline(model, chat_template_config=chat_template_config,
                backend_config=TurbomindEngineConfig(session_len=8192))

image_urls=[
    "https://templates.invoicehome.com/invoice-template-us-neat-750px.png"
]
prompts = [('fill the json values by extracting the appropriate values from the image. {"invoice number":"","ship to":"","company":"","amount":""}', load_image(img_url)) for img_url in image_urls]
response = pipe(prompts)
print(response)

when i try to use it for inference using lmdeploy i am facing this issue...

UnboundLocalError: cannot access local variable 'head_num' where it is not associated with a value

Reproduction

  1. download a model locally
  2. refer the local model in pipeline

Environment

sys.platform: linux
Python: 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0: NVIDIA A10G
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.1, V12.1.105
GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
PyTorch: 2.2.2+cu121
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.3.2 (Git Hash 2dc95a2ad0841e29db8b22fbccaf3e5da7992b01)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 12.1
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.9.2
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.2.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF, 

TorchVision: 0.17.2+cu121
LMDeploy: 0.5.0+
transformers: 4.42.3
gradio: Not Found
fastapi: 0.111.0
pydantic: 2.8.2
triton: 2.2.0

Error traceback

---------------------------------------------------------------------------
UnboundLocalError                         Traceback (most recent call last)
Cell In[1], line 8
      6 model = 'internVL-new/files/model'
      7 chat_template_config = ChatTemplateConfig('internvl-internlm2')
----> 8 pipe = pipeline(model, chat_template_config=chat_template_config,
      9                 backend_config=TurbomindEngineConfig(session_len=8192))
     11 image_urls=[
     12     "https://templates.invoicehome.com/invoice-template-us-neat-750px.png"
     13 ]
     14 prompts = [('fill the json values by extracting the appropriate values from the image. {"invoice number":"","ship to":"","company":"","amount":""}', load_image(img_url)) for img_url in image_urls]

File ~/.conda/envs/new_env/lib/python3.11/site-packages/lmdeploy/api.py:89, in pipeline(model_path, model_name, backend_config, chat_template_config, log_level, **kwargs)
     86 else:
     87     tp = 1 if backend_config is None else backend_config.tp
---> 89 return pipeline_class(model_path,
     90                       model_name=model_name,
     91                       backend=backend,
     92                       backend_config=backend_config,
     93                       chat_template_config=chat_template_config,
     94                       tp=tp,
     95                       **kwargs)

File ~/.conda/envs/new_env/lib/python3.11/site-packages/lmdeploy/serve/vl_async_engine.py:24, in VLAsyncEngine.__init__(self, model_path, **kwargs)
     20 backend_config = kwargs.get('backend_config', None)
     21 self.vl_encoder = ImageEncoder(model_path,
     22                                vision_config,
     23                                backend_config=backend_config)
---> 24 super().__init__(model_path, **kwargs)
     25 if self.model_name == 'base':
     26     raise RuntimeError(
     27         'please specify chat template as guided in https://lmdeploy.readthedocs.io/en/latest/inference/vl_pipeline.html#set-chat-template'  # noqa: E501
     28     )

File ~/.conda/envs/new_env/lib/python3.11/site-packages/lmdeploy/serve/async_engine.py:189, in AsyncEngine.__init__(self, model_path, model_name, backend, backend_config, chat_template_config, tp, **kwargs)
    187 # build backend engine
    188 if backend == 'turbomind':
--> 189     self._build_turbomind(model_path=model_path,
    190                           backend_config=backend_config,
    191                           tp=tp,
    192                           **kwargs)
    193 elif backend == 'pytorch':
    194     self._build_pytorch(model_path=model_path,
    195                         backend_config=backend_config,
    196                         **kwargs)

File ~/.conda/envs/new_env/lib/python3.11/site-packages/lmdeploy/serve/async_engine.py:234, in AsyncEngine._build_turbomind(self, model_path, backend_config, tp, **kwargs)
    230 assert isinstance(backend_config, TurbomindEngineConfig), 'Please'\
    231     ' use TurbomindEngineConfig imported from lmdeploy.messages for ' \
    232     'turbomind backend'
    233 from lmdeploy import turbomind as tm
--> 234 self.engine = tm.TurboMind.from_pretrained(
    235     model_path, engine_config=backend_config, **kwargs)
    236 self.backend_config = backend_config
    237 self.hf_tm_cfg = self.engine.config

File ~/.conda/envs/new_env/lib/python3.11/site-packages/lmdeploy/turbomind/turbomind.py:342, in TurboMind.from_pretrained(cls, pretrained_model_name_or_path, engine_config, model_format, group_size, tp, **kwargs)
    340 model_source = get_model_source(pretrained_model_name_or_path)
    341 logger.info(f'model_source: {model_source}')
--> 342 return cls(model_path=pretrained_model_name_or_path,
    343            engine_config=engine_config,
    344            model_source=model_source,
    345            model_format=model_format,
    346            group_size=group_size,
    347            tp=tp,
    348            **kwargs)

File ~/.conda/envs/new_env/lib/python3.11/site-packages/lmdeploy/turbomind/turbomind.py:144, in TurboMind.__init__(self, model_path, engine_config, model_source, model_name, model_format, group_size, tp, **kwargs)
    141         model_path = get_model(model_path, engine_config.download_dir,
    142                                engine_config.revision)
    143     self.tokenizer = Tokenizer(model_path)
--> 144     self.model_comm = self._from_hf(model_source=model_source,
    145                                     model_path=model_path,
    146                                     engine_config=engine_config)
    148 self.session_len = self.config.session_len
    149 self.eos_id = self.tokenizer.eos_token_id

File ~/.conda/envs/new_env/lib/python3.11/site-packages/lmdeploy/turbomind/turbomind.py:237, in TurboMind._from_hf(self, model_source, model_path, engine_config)
    232 output_model_name, cfg = get_output_model_registered_name_and_config(
    233     model_path=model_path,
    234     model_format=engine_config.model_format,
    235     group_size=0)
    236 cfg.update_from_engine_config(engine_config)
--> 237 output_model = OUTPUT_MODELS.get(output_model_name)(
    238     input_model=input_model, cfg=cfg, to_file=False, out_dir='')
    240 self.config = output_model.cfg
    241 self.config.model_name = match_name \
    242     if match_name is not None else 'base'

File ~/.conda/envs/new_env/lib/python3.11/site-packages/lmdeploy/turbomind/deploy/target_model/fp.py:26, in TurbomindModel.__init__(self, input_model, cfg, to_file, out_dir)
     21 def __init__(self,
     22              input_model: BaseInputModel,
     23              cfg: TurbomindModelConfig,
     24              to_file: bool = True,
     25              out_dir: str = ''):
---> 26     super().__init__(input_model, cfg, to_file, out_dir)

File ~/.conda/envs/new_env/lib/python3.11/site-packages/lmdeploy/turbomind/deploy/target_model/base.py:168, in BaseOutputModel.__init__(self, input_model, cfg, to_file, out_dir)
    166 self.cfg = cfg
    167 if not cfg.valid:
--> 168     self.cfg = self.get_config(cfg)
    169 assert self.cfg.valid
    170 self.to_file = to_file

File ~/.conda/envs/new_env/lib/python3.11/site-packages/lmdeploy/turbomind/deploy/target_model/fp.py:30, in TurbomindModel.get_config(self, cfg)
     28 def get_config(self, cfg: TurbomindModelConfig):
     29     """Get turbomind config."""
---> 30     final_cfg = super().get_config(cfg).__dict__
     32     # attn_bias, inter_size
     33     visit = False

File ~/.conda/envs/new_env/lib/python3.11/site-packages/lmdeploy/turbomind/deploy/target_model/base.py:192, in BaseOutputModel.get_config(self, cfg)
    190         head_num = dim // cfg.size_per_head
    191         break
--> 192 final_cfg.update(dict(head_num=head_num, vocab_size=_vocab_size))
    193 return TurbomindModelConfig.from_dict(final_cfg, allow_none=True)
UnboundLocalError: cannot access local variable 'head_num' where it is not associated with a value
lvhan028 commented 4 months ago

It indicates that lmdeploy didn't get the token_embeddings weights. If you can reproduce this issue with the official internVL mini model, please kindly let us know the model's huggingface repo_id.

# head_num, vocab_size
for bin in self.input_model.bins():
    emb = bin.tok_embeddings()
    if emb is not None:
        _vocab_size, dim = emb.shape
        head_num = dim // cfg.size_per_head
        break
final_cfg.update(dict(head_num=head_num, vocab_size=_vocab_size))
shiva-vardhineedi commented 4 months ago

Hi @lvhan028 , Thanks for the reply... The locally downloaded base model works fine... The issue is with my Lora finetuned model.I uploaded my locally finetuned model on hugging face. I am able to succesfully load the model but there are no generations being done upon using that model on LMDeploy.. Can you please let me know what might be wrong with this finetuned model? its not generating response... i am getting an empty text as response as below:

image

irexyc commented 3 months ago

@shiva-vardhineedi

Have you tried inference your hf model with transformers api ?

I inference your model with transformers api, but the result is weird.

below is my test code and the output is empty:

from transformers import AutoTokenizer, AutoModel

m = AutoModel.from_pretrained('shivavardhineedi/mini_internVL_eval', trust_remote_code=True).cuda().bfloat16()
tok = AutoTokenizer.from_pretrained('shivavardhineedi/mini_internVL_eval', trust_remote_code=True)

m.chat(tok, None, 'hello', dict(max_new_tokens=100))
# empty output

I print line 278 and 281 in https://huggingface.co/shivavardhineedi/mini_internVL_eval/blob/main/modeling_internvl_chat.py:

tensor([[    1, 92543,  9081,   364,  2770,   657,   589, 15358, 17993,  6843,
           963,   505,  4576, 11146,   451, 60628, 60384, 60721, 62442, 60752,
           699, 92542, 92543,  1008,   364, 15115, 92542, 92543,   525, 11353,
           364]], device='cuda:0')

tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0]], device='cuda:0')
shiva-vardhineedi commented 2 months ago

@irexyc i used LoRA for finetuning and uploaded the model here by copying some missing files from original model. Why am i getting this?