InternLM / lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
https://lmdeploy.readthedocs.io/en/latest/
Apache License 2.0
4.21k stars 381 forks source link

[Feature] Support for OpenGVLab/InternVL-Chat-V1-2-Plus #1495

Closed Iven2132 closed 4 months ago

Iven2132 commented 4 months ago

[### Motivation

OpenGVLab/InternVL-Chat-V1-2-Plus is a open source alternative to GPT-4V. Can we please have support for that?

Related resources

https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2-Plus

irexyc commented 4 months ago

We have already supported InternVL-Chat-V1-1, InternVL-Chat-V1-2 and InternVL-Chat-V1-2-Plus.

There are some docs about usage: https://github.com/InternLM/lmdeploy/blob/main/docs/en/inference/vl_pipeline.md https://github.com/InternLM/lmdeploy/blob/main/docs/en/serving/api_server_vl.md

All you have to do is change the model id from liuhaotian/llava-v1.6-vicuna-7b to OpenGVLab/InternVL-Chat-V1-2-Plus in the docs.

We use the folder name to map chat template. Therefore if you download model yourself from huggingface, it's better to keep the folder name or you have to set ChatTemplateConfig correctly when you init pipeline or server.

Iven2132 commented 4 months ago

We have already supported InternVL-Chat-V1-1, InternVL-Chat-V1-2 and InternVL-Chat-V1-2-Plus.

There are some docs about usage: https://github.com/InternLM/lmdeploy/blob/main/docs/en/inference/vl_pipeline.md https://github.com/InternLM/lmdeploy/blob/main/docs/en/serving/api_server_vl.md

All you have to do is change the model id from liuhaotian/llava-v1.6-vicuna-7b to OpenGVLab/InternVL-Chat-V1-2-Plus in the docs.

We use the folder name to map chat template. Therefore if you download model yourself from huggingface, it's better to keep the folder name or you have to set ChatTemplateConfig correctly when you init pipeline or server.

@lvhan028 I'm getting this error: ArgumentError: Please set model_name for /teamspace/studios/this_studio/.cache/huggingface/hub/models--OpenGVLab--InternVL-Chat-V1-2-Plus/snapshots/973b6d6b624d87bfb7e6371680da1c4e2c51a24f

Can you please help me?

my code:

from lmdeploy import pipeline
from lmdeploy.vl import load_image

pipe = pipeline('OpenGVLab/InternVL-Chat-V1-2-Plus')

image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
response = pipe(('describe this image', image))
print(response)
irexyc commented 4 months ago

@Iven2132

It seems that they change their repo name recently and our folder name matching strategy failed.

https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2/commit/62d21d2ee98a6fabdea460d17b05e8040f217344

image

You can pass the chat template when init the pipeline like this:

from lmdeploy import pipeline, ChatTemplateConfig
from lmdeploy.vl import load_image

pipe = pipeline('OpenGVLab/InternVL-Chat-V1-2-Plus', chat_template_config=ChatTemplateConfig(model_name='internvl-zh-hermes2’))

image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
response = pipe(('describe this image', image))
print(response)
Iven2132 commented 4 months ago

@Iven2132

It seems that they change their repo name recently and our folder name matching strategy failed.

https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2/commit/62d21d2ee98a6fabdea460d17b05e8040f217344 image

You can pass the chat template when init the pipeline like this:

from lmdeploy import pipeline, ChatTemplateConfig
from lmdeploy.vl import load_image

pipe = pipeline('OpenGVLab/InternVL-Chat-V1-2-Plus', chat_template_config=ChatTemplateConfig(model_name='internvl-zh-hermes2’))

image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
response = pipe(('describe this image', image))
print(response)

I'm getting RuntimeError: [TM][ERROR] Assertion fail: /lmdeploy/src/turbomind/triton_backend/llama/LlamaTritonModel.cc:417

error

irexyc commented 4 months ago

I can't reproduce the error.

Could you open the logger by

export TM_LOG_LEVEL=DEBUG
export TM_DEBUG_LEVEL=DEBUG

and then print the all log lines when creating the pipeline

from lmdeploy import pipeline, ChatTemplateConfig
pipe = pipeline('/home/chenxin/InternVL-Chat-V1-5', chat_template_config=ChatTemplateConfig(model_name='internvl-zh-hermes2'))
lvhan028 commented 4 months ago

@Iven2132 what's your GPU model?

Iven2132 commented 4 months ago

@Iven2132 what's your GPU model?

I'm using A100 80 GB.

Iven2132 commented 4 months ago

I can't reproduce the error.

Could you open the logger by

export TM_LOG_LEVEL=DEBUG
export TM_DEBUG_LEVEL=DEBUG

and then print the all log lines when creating the pipeline

from lmdeploy import pipeline, ChatTemplateConfig
pipe = pipeline('/home/chenxin/InternVL-Chat-V1-5', chat_template_config=ChatTemplateConfig(model_name='internvl-zh-hermes2'))

@irexyc I'm getting this error now:

[/usr/local/lib/python3.11/site-packages/tqdm/auto.py:21](https://rufs8usy7vtn6i.r5.modal.host/usr/local/lib/python3.11/site-packages/tqdm/auto.py#line=20): TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
Fetching 35 files: 100%|██████████| 35/35 [05:43<00:00,  9.83s[/it](https://rufs8usy7vtn6i.r5.modal.host/it)]
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
File [/usr/local/lib/python3.11/site-packages/transformers/utils/import_utils.py:1510](https://rufs8usy7vtn6i.r5.modal.host/usr/local/lib/python3.11/site-packages/transformers/utils/import_utils.py#line=1509), in _LazyModule._get_module(self, module_name)
   1509 try:
-> 1510     return importlib.import_module("." + module_name, self.__name__)
   1511 except Exception as e:

File [/usr/local/lib/python3.11/importlib/__init__.py:126](https://rufs8usy7vtn6i.r5.modal.host/usr/local/lib/python3.11/importlib/__init__.py#line=125), in import_module(name, package)
    125         level += 1
--> 126 return _bootstrap._gcd_import(name[level:], package, level)

File <frozen importlib._bootstrap>:1204, in _gcd_import(name, package, level)

File <frozen importlib._bootstrap>:1176, in _find_and_load(name, import_)

File <frozen importlib._bootstrap>:1147, in _find_and_load_unlocked(name, import_)

File <frozen importlib._bootstrap>:690, in _load_unlocked(spec)

File <frozen importlib._bootstrap_external>:940, in exec_module(self, module)

File <frozen importlib._bootstrap>:241, in _call_with_frames_removed(f, *args, **kwds)

File [/usr/local/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py:55](https://rufs8usy7vtn6i.r5.modal.host/usr/local/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py#line=54)
     54 if is_flash_attn_2_available():
---> 55     from flash_attn import flash_attn_func, flash_attn_varlen_func
     56     from flash_attn.bert_padding import index_first_axis, pad_input, unpad_input  # noqa

File [/usr/local/lib/python3.11/site-packages/flash_attn/__init__.py:3](https://rufs8usy7vtn6i.r5.modal.host/usr/local/lib/python3.11/site-packages/flash_attn/__init__.py#line=2)
      1 __version__ = "2.5.7"
----> 3 from flash_attn.flash_attn_interface import (
      4     flash_attn_func,
      5     flash_attn_kvpacked_func,
      6     flash_attn_qkvpacked_func,
      7     flash_attn_varlen_func,
      8     flash_attn_varlen_kvpacked_func,
      9     flash_attn_varlen_qkvpacked_func,
     10     flash_attn_with_kvcache,
     11 )

File [/usr/local/lib/python3.11/site-packages/flash_attn/flash_attn_interface.py:10](https://rufs8usy7vtn6i.r5.modal.host/usr/local/lib/python3.11/site-packages/flash_attn/flash_attn_interface.py#line=9)
      8 # isort: off
      9 # We need to import the CUDA kernels after importing torch
---> 10 import flash_attn_2_cuda as flash_attn_cuda
     12 # isort: on

ImportError: [/usr/local/lib/python3.11/site-packages/flash_attn_2_cuda.cpython-311-x86_64-linux-gnu.so](https://rufs8usy7vtn6i.r5.modal.host/usr/local/lib/python3.11/site-packages/flash_attn_2_cuda.cpython-311-x86_64-linux-gnu.so): undefined symbol: _ZN3c104cuda9SetDeviceEi

The above exception was the direct cause of the following exception:

RuntimeError                              Traceback (most recent call last)
Cell In[11], line 4
      1 from lmdeploy import pipeline, ChatTemplateConfig
      2 from lmdeploy.vl import load_image
----> 4 pipe = pipeline(
      5     'OpenGVLab[/InternVL-Chat-V1-2-Plus](https://rufs8usy7vtn6i.r5.modal.host/InternVL-Chat-V1-2-Plus)',
      6     chat_template_config=ChatTemplateConfig(model_name='internvl-zh-hermes2')
      7 )
      9 image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
     10 response = pipe(('describe this image', image))

File /usr/local/lib/python3.11/site-packages/lmdeploy/api.py:94, in pipeline(model_path, model_name, backend_config, chat_template_config, log_level, **kwargs)
     91 else:
     92     tp = 1 if backend_config is None else backend_config.tp
---> 94 return pipeline_class(model_path,
     95                       model_name=model_name,
     96                       backend=backend,
     97                       backend_config=backend_config,
     98                       chat_template_config=chat_template_config,
     99                       tp=tp,
    100                       **kwargs)

File [/usr/local/lib/python3.11/site-packages/lmdeploy/serve/vl_async_engine.py:16](https://rufs8usy7vtn6i.r5.modal.host/usr/local/lib/python3.11/site-packages/lmdeploy/serve/vl_async_engine.py#line=15), in VLAsyncEngine.__init__(self, model_path, **kwargs)
     15 def __init__(self, model_path: str, **kwargs) -> None:
---> 16     self.vl_encoder = ImageEncoder(model_path)
     17     super().__init__(model_path, **kwargs)
     18     if self.model_name == 'base':

File [/usr/local/lib/python3.11/site-packages/lmdeploy/vl/engine.py:68](https://rufs8usy7vtn6i.r5.modal.host/usr/local/lib/python3.11/site-packages/lmdeploy/vl/engine.py#line=67), in ImageEncoder.__init__(self, model_path, max_batch_size)
     67 def __init__(self, model_path: str, max_batch_size: int = 16):
---> 68     self.model = load_vl_model(model_path)
     69     self.max_batch_size = max_batch_size
     70     self.loop = asyncio.new_event_loop()

File [/usr/local/lib/python3.11/site-packages/lmdeploy/vl/model/builder.py:42](https://rufs8usy7vtn6i.r5.modal.host/usr/local/lib/python3.11/site-packages/lmdeploy/vl/model/builder.py#line=41), in load_vl_model(model_path)
     40     return Xcomposer2VisionModel(model_path)
     41 if arch == 'InternVLChatModel':
---> 42     return InternVLVisionModel(model_path)
     43 if arch == 'MiniGeminiLlamaForCausalLM':
     44     return MiniGeminiVisionModel(model_path)

File [/usr/local/lib/python3.11/site-packages/lmdeploy/vl/model/internvl.py:19](https://rufs8usy7vtn6i.r5.modal.host/usr/local/lib/python3.11/site-packages/lmdeploy/vl/model/internvl.py#line=18), in InternVLVisionModel.__init__(self, model_path, device)
     17 self.model_path = model_path
     18 self.device = device
---> 19 self.build_model()

File [/usr/local/lib/python3.11/site-packages/lmdeploy/vl/model/internvl.py:27](https://rufs8usy7vtn6i.r5.modal.host/usr/local/lib/python3.11/site-packages/lmdeploy/vl/model/internvl.py#line=26), in InternVLVisionModel.build_model(self)
     24 with init_empty_weights():
     25     config = AutoConfig.from_pretrained(self.model_path,
     26                                         trust_remote_code=True)
---> 27     model = AutoModel.from_config(config, trust_remote_code=True)
     28     del model.language_model
     29     model.half()

File [/usr/local/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py:428](https://rufs8usy7vtn6i.r5.modal.host/usr/local/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py#line=427), in _BaseAutoModelClass.from_config(cls, config, **kwargs)
    426 else:
    427     repo_id = config.name_or_path
--> 428 model_class = get_class_from_dynamic_module(class_ref, repo_id, **kwargs)
    429 if os.path.isdir(config._name_or_path):
    430     model_class.register_for_auto_class(cls.__name__)

File [/usr/local/lib/python3.11/site-packages/transformers/dynamic_module_utils.py:501](https://rufs8usy7vtn6i.r5.modal.host/usr/local/lib/python3.11/site-packages/transformers/dynamic_module_utils.py#line=500), in get_class_from_dynamic_module(class_reference, pretrained_model_name_or_path, cache_dir, force_download, resume_download, proxies, token, revision, local_files_only, repo_type, code_revision, **kwargs)
    488 # And lastly we get the class inside our newly created module
    489 final_module = get_cached_module_file(
    490     repo_id,
    491     module_file + ".py",
   (...)
    499     repo_type=repo_type,
    500 )
--> 501 return get_class_in_module(class_name, final_module)

File [/usr/local/lib/python3.11/site-packages/transformers/dynamic_module_utils.py:201](https://rufs8usy7vtn6i.r5.modal.host/usr/local/lib/python3.11/site-packages/transformers/dynamic_module_utils.py#line=200), in get_class_in_module(class_name, module_path)
    199 name = os.path.normpath(module_path).replace(".py", "").replace(os.path.sep, ".")
    200 module_path = str(Path(HF_MODULES_CACHE) [/](https://rufs8usy7vtn6i.r5.modal.host/) module_path)
--> 201 module = importlib.machinery.SourceFileLoader(name, module_path).load_module()
    202 return getattr(module, class_name)

File <frozen importlib._bootstrap_external>:605, in _check_name_wrapper(self, name, *args, **kwargs)

File <frozen importlib._bootstrap_external>:1120, in load_module(self, fullname)

File <frozen importlib._bootstrap_external>:945, in load_module(self, fullname)

File <frozen importlib._bootstrap>:290, in _load_module_shim(self, fullname)

File <frozen importlib._bootstrap>:721, in _load(spec)

File <frozen importlib._bootstrap>:690, in _load_unlocked(spec)

File <frozen importlib._bootstrap_external>:940, in exec_module(self, module)

File <frozen importlib._bootstrap>:241, in _call_with_frames_removed(f, *args, **kwds)

File [~/.cache/huggingface/modules/transformers_modules/973b6d6b624d87bfb7e6371680da1c4e2c51a24f/modeling_internvl_chat.py:13](https://rufs8usy7vtn6i.r5.modal.host/lab/tree/~/.cache/huggingface/modules/transformers_modules/973b6d6b624d87bfb7e6371680da1c4e2c51a24f/modeling_internvl_chat.py#line=12)
     11 from torch import nn
     12 from torch.nn import CrossEntropyLoss
---> 13 from transformers import GenerationConfig, LlamaForCausalLM, LlamaTokenizer
     14 from transformers.generation.logits_process import LogitsProcessorList
     15 from transformers.generation.stopping_criteria import StoppingCriteriaList

File <frozen importlib._bootstrap>:1229, in _handle_fromlist(module, fromlist, import_, recursive)

File [/usr/local/lib/python3.11/site-packages/transformers/utils/import_utils.py:1501](https://rufs8usy7vtn6i.r5.modal.host/usr/local/lib/python3.11/site-packages/transformers/utils/import_utils.py#line=1500), in _LazyModule.__getattr__(self, name)
   1499 elif name in self._class_to_module.keys():
   1500     module = self._get_module(self._class_to_module[name])
-> 1501     value = getattr(module, name)
   1502 else:
   1503     raise AttributeError(f"module {self.__name__} has no attribute {name}")

File [/usr/local/lib/python3.11/site-packages/transformers/utils/import_utils.py:1500](https://rufs8usy7vtn6i.r5.modal.host/usr/local/lib/python3.11/site-packages/transformers/utils/import_utils.py#line=1499), in _LazyModule.__getattr__(self, name)
   1498     value = self._get_module(name)
   1499 elif name in self._class_to_module.keys():
-> 1500     module = self._get_module(self._class_to_module[name])
   1501     value = getattr(module, name)
   1502 else:

File [/usr/local/lib/python3.11/site-packages/transformers/utils/import_utils.py:1512](https://rufs8usy7vtn6i.r5.modal.host/usr/local/lib/python3.11/site-packages/transformers/utils/import_utils.py#line=1511), in _LazyModule._get_module(self, module_name)
   1510     return importlib.import_module("." + module_name, self.__name__)
   1511 except Exception as e:
-> 1512     raise RuntimeError(
   1513         f"Failed to import {self.__name__}.{module_name} because of the following error (look up to see its"
   1514         f" traceback):\n{e}"
   1515     ) from e

RuntimeError: Failed to import transformers.models.llama.modeling_llama because of the following error (look up to see its traceback):
[/usr/local/lib/python3.11/site-packages/flash_attn_2_cuda.cpython-311-x86_64-linux-gnu.so](https://rufs8usy7vtn6i.r5.modal.host/usr/local/lib/python3.11/site-packages/flash_attn_2_cuda.cpython-311-x86_64-linux-gnu.so): undefined symbol: _ZN3c104cuda9SetDeviceEi
irexyc commented 4 months ago

ImportError: /usr/local/lib/python3.11/site-packages/flash_attn_2_cuda.cpython-311-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEi

I think it is an environment issue. And I find this https://github.com/Dao-AILab/flash-attention/issues/620

Iven2132 commented 4 months ago

But I have latest version of flash attention, Can you run the model on notebook and share with me?

I wonder if InternLM/lmdeploy uses TensorRT + Triton?

On Thu, 25 Apr 2024, 23:01 Chen Xin, @.***> wrote:

ImportError: /usr/local/lib/python3.11/site-packages/flash_attn_2_cuda.cpython-311-x86_64-linux-gnu.so https://rufs8usy7vtn6i.r5.modal.host/usr/local/lib/python3.11/site-packages/flash_attn_2_cuda.cpython-311-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEi

I think it is an environment issue. And I find this Dao-AILab/flash-attention#620 https://github.com/Dao-AILab/flash-attention/issues/620

— Reply to this email directly, view it on GitHub https://github.com/InternLM/lmdeploy/issues/1495#issuecomment-2077804689, or unsubscribe https://github.com/notifications/unsubscribe-auth/A3UMSYVMXXIEP7MFVF2QRW3Y7E4XHAVCNFSM6AAAAABGYJFB6SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZXHAYDINRYHE . You are receiving this because you were mentioned.Message ID: @.***>

irexyc commented 4 months ago

For vison language model, there are two parts: vision and language.

For language model(llm), lmdeploy has two backends: turbomind (developed upon NVIDIA/FasterTransformer) and pytorch (developed upon pytorch and openai/triton).

For vision model, we reuse their code to do the inference.

From your log, the error happends on AutoModel.from_config. Therefore, I think it's an environment problem. You can try if you can run their demo with transformers https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2-Plus#model-usage

Iven2132 commented 4 months ago

For vison language model, there are two parts: vision and language.

For language model(llm), lmdeploy has two backends: turbomind (developed upon NVIDIA/FasterTransformer) and pytorch (developed upon pytorch and openai/triton).

For vision model, we reuse their code to do the inference.

From your log, the error happends on AutoModel.from_config. Therefore, I think it's an environment problem. You can try if you can run their demo with transformers https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2-Plus#model-usage

I got error from that too, How to setup environment? Do you have any notebook to do that?

Iven2132 commented 4 months ago

For vison language model, there are two parts: vision and language.

For language model(llm), lmdeploy has two backends: turbomind (developed upon NVIDIA/FasterTransformer) and pytorch (developed upon pytorch and openai/triton).

For vision model, we reuse their code to do the inference.

From your log, the error happends on AutoModel.from_config. Therefore, I think it's an environment problem. You can try if you can run their demo with transformers https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2-Plus#model-usage

@irexyc I got this error: ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ ValueError: Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes. You passed torch.float32, this might lead to unexpected behaviour.

Any fix?

Iven2132 commented 4 months ago

irexyc

@irexyc Any idea what can I do?

irexyc commented 4 months ago

@Iven2132

It seems you fixed the flash_attn import issue.

You can upgrade your transformers by pip install -U transformers. After 4.37.0, they print a warning message instead of raise error. https://github.com/huggingface/transformers/blob/v4.37.0/src/transformers/modeling_utils.py#L1509-L1513

Iven2132 commented 4 months ago

@Iven2132

It seems you fixed the flash_attn import issue.

You can upgrade your transformers by pip install -U transformers. After 4.37.0, they print a warning message instead of raise error. https://github.com/huggingface/transformers/blob/v4.37.0/src/transformers/modeling_utils.py#L1509-L1513

@irexyc Yeah, It does work now, but it's super slow it takes 5-6 mins to get the output, any way to load the model fast for a faster inference?

   from lmdeploy import pipeline, ChatTemplateConfig
        from lmdeploy.vl import load_image

        pipe = pipeline('OpenGVLab/InternVL-Chat-V1-2-Plus', chat_template_config=ChatTemplateConfig(model_name='internvl-zh-hermes2'))

        image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
        response = pipe(('describe this image', image))
        print(response)
irexyc commented 4 months ago

Yeah, It does work now, but it's super slow it takes 5-6 mins to get the output, any way to load the model fast for a faster inference?

Does 5-6 mins including loading the model? If so, why do you init the pipeline multiple times? You can init it once and reuse the pipeline or you can init a serve and use client api. If the time doesn't include the loading time, you can share the prompt and image, I will check it tomorrow.

BTW,it takes about 40 seconds to init the pipeline, how much memory does your machine have?

Iven2132 commented 4 months ago

Yeah, It does work now, but it's super slow it takes 5-6 mins to get the output, any way to load the model fast for a faster inference?

Does 5-6 mins including loading the model? If so, why do you init the pipeline multiple times? You can init it once and reuse the pipeline or you can init a serve and use client api. If the time doesn't include the loading time, you can share the prompt and image, I will check it tomorrow.

BTW,it takes about 40 seconds to init the pipeline, how much memory does your machine have?

@irexyc I'm using Modal for deploying and running A100 80 GB. Can you share your code? How to deploy properly is it possible to get output in 4-5 seconds on A100 or H100?

irexyc commented 4 months ago

The cost time is related to image processing and the number of input and output token. And you should not init the model every time you received a request.

A reasonable logic is to initialize the model once when the program starts, and use the initialized model to handle reveived request.

The lmdeploy.pipeline is for offline usage and pipeline.__call__ is not thread safe. For your situation, you could use lmdeploy.api.serve and lmdeploy.api.client

You can edit the code upon this example. In load funciton, use lmdeploy.api.serve to init the model. In generate function, use lmdeploy.api.client to communicate with the model created by lmdeploy.api.serve

Iven2132 commented 4 months ago

Thanks! Can I please get the docs for using lmdeploy.api.client and lmdeploy.api.serve

On Sat, 27 Apr 2024, 23:15 Chen Xin, @.***> wrote:

The cost time is related to image processing and the number of input and output token. And you should not init the model every time you received a request.

A reasonable logic is to initialize the model once when the program starts, and use the initialized model to handle reveived request.

The lmdeploy.pipeline is for offline usage and pipeline.call is not thread safe. For your situation, you could use lmdeploy.api.serve and lmdeploy.api.client

You can edit the code upon this example https://modal.com/docs/examples/trtllm_llama. In load funciton, use lmdeploy.api.serve to init the model. In generate function, use lmdeploy.api.client to communicate with the model created by lmdeploy.api.serve

— Reply to this email directly, view it on GitHub https://github.com/InternLM/lmdeploy/issues/1495#issuecomment-2081111824, or unsubscribe https://github.com/notifications/unsubscribe-auth/A3UMSYV73S6EBX6XGCQMGP3Y7PP3RAVCNFSM6AAAAABGYJFB6SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBRGEYTCOBSGQ . You are receiving this because you were mentioned.Message ID: @.***>

irexyc commented 4 months ago

We don't have docs about it, but the usage is similiar to serve and client in DeepSpeed-MII https://github.com/microsoft/DeepSpeed-MII/blob/main/docs/source/quick-start.rst#run-a-persistent-deployment

In short, a server is started in the background (new process), and the client is to comminicate to the serve.

lvhan028 commented 4 months ago

We don't have docs about it, but the usage is similiar to serve and client in DeepSpeed-MII https://github.com/microsoft/DeepSpeed-MII/blob/main/docs/source/quick-start.rst#run-a-persistent-deployment

In short, a server is started in the background (new process), and the client is to comminicate to the serve.

we have the user guide please check this one: https://lmdeploy.readthedocs.io/en/latest/serving/api_server_vl.html

Iven2132 commented 4 months ago

We don't have docs about it, but the usage is similiar to serve and client in DeepSpeed-MII https://github.com/microsoft/DeepSpeed-MII/blob/main/docs/source/quick-start.rst#run-a-persistent-deployment

In short, a server is started in the background (new process), and the client is to comminicate to the serve.

I tried that, but I couldn't figure it out, can you please help set things up?

import os
import time
import modal

GPU_CONFIG = modal.gpu.A100(memory=80, count=1)

vllm_image = (
    modal.Image.micromamba()
    .apt_install("git")
    .micromamba_install(
        "cudatoolkit=11.8",
        "cudnn=8.1.0",
        "cuda-nvcc",
        channels=["conda-forge", "nvidia"],
    )
    .pip_install(
        "lmdeploy==0.4.0",
        "wheel==0.43.0",
        "torch==2.2.1",
        "torchvision==0.17.1",
        "transformers==4.40.0",
        "timm==0.9.12",
        "Pillow==10.3.0",
        "vllm==0.4.1",
        "peft==0.8.2",
        "lmdeploy==0.4.0",
    )  
    .run_commands("pip install flash-attn==2.5.2 --no-build-isolation")
    .env({"CUDA_HOME": "/usr/local/cuda"})
)

app = modal.App("test-app") 

@app.cls(
    gpu=GPU_CONFIG,
    timeout=1200,
    container_idle_timeout=1200,
    allow_concurrent_inputs=10,
    image=vllm_image,
)
class Model:
    @modal.enter()
    def start_engine(self):
        from lmdeploy import pipeline, ChatTemplateConfig
        from lmdeploy.vl import load_image
        import lmdeploy

        client = lmdeploy.api.client("OpenGVLab/InternVL-Chat-V1-2-Plus", chat_template_config=ChatTemplateConfig(model_name='internvl-zh-hermes2'))
        print(client)

    @modal.method()
    async def completion_stream(self, user_question):
        from lmdeploy.vl import load_image
        import lmdeploy

        image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
        response = lmdeploy.api.serve("OpenGVLab/InternVL-Chat-V1-2-Plus", chat_template_config=ChatTemplateConfig(model_name='internvl-zh-hermes2'))
        print(response)
irexyc commented 4 months ago

init the server in model initialization and use client for user request.

some thing like this:

...

def start_engine(self):
    from lmdeploy import serve, ChatTemplateConfig
    self.server = serve('OpenGVLab/InternVL-Chat-V1-2-Plus',
      chat_template_config=ChatTemplateConfig(model_name='internvl-zh-hermes2'),
      server_name='0.0.0.0',
      server_port=23333)

...

async def completion_stream(self, user_question):
    ...
    from lmdeploy import client
    handle = client(api_server_url='http://0.0.0.0:23333')
    model_name = handle.available_models[0]
    # construct message, openai format
    messages = {
        'role': 'user',
        'content': [
            {
                'type': 'text',
                'text': 'Describe the image please',
            }, 
            {
                'type': 'image_url',
                'image_url': 
                {
                        'url':
                        'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg',
                },
            }
        ],
    }

    # handle outputs, you should process and output the outputs.
    outputs = handle.chat_completions_v1(model=model_name, messages=messages)
    for out in outputs:
      print(out)
    ...
Iven2132 commented 4 months ago

init the server in model initialization and use client for user request.

some thing like this:

...

def start_engine(self):
    from lmdeploy import serve, ChatTemplateConfig
    self.server = serve('OpenGVLab/InternVL-Chat-V1-2-Plus',
      chat_template_config=ChatTemplateConfig(model_name='internvl-zh-hermes2'),
      server_name='0.0.0.0',
      server_port=23333)

...

async def completion_stream(self, user_question):
    ...
    from lmdeploy import client
    handle = client(api_server_url='http://0.0.0.0:23333')
    model_name = handle.available_models[0]
    # construct message, openai format
    messages = {
        'role': 'user',
        'content': [
            {
                'type': 'text',
                'text': 'Describe the image please',
            }, 
            {
                'type': 'image_url',
                'image_url': 
                {
                        'url':
                        'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg',
                },
            }
        ],
    }

    # handle outputs, you should process and output the outputs.
    outputs = handle.chat_completions_v1(model=model_name, messages=messages)
    for out in outputs:
      print(out)
    ...

Thanks🙏 But I'm not sure why I'm getting this input should be a valid string:

Error:

   "detail": [
      {
        "type": "string_type",
        "loc": [
          "body",
          "messages",
          "str"
        ],
        "msg": "Input should be a valid string",
        "input": {
          "role": "user",
          "content": [
            {
              "type": "text",
              "text": "Describe the image please"
            },
            {
              "type": "image_url",
              "image_url": {
                "url": "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg"
              }
            }
          ]
        }
      },
      {
        "type": "list_type",
        "loc": [
          "body",
          "messages",
          "list[dict[str,any]]"
        ],
        "msg": "Input should be a valid list",
        "input": {
          "role": "user",
          "content": [
            {
              "type": "text",
              "text": "Describe the image please"
            },
            {
              "type": "image_url",
              "image_url": {
                "url": "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg"
              }
            }
          ]
        }
      }
    ]
irexyc commented 4 months ago

@Iven2132

change this line

outputs = handle.chat_completions_v1(model=model_name, messages=[messages])
Iven2132 commented 4 months ago

@irexyc It was a format issue, I Fixed that but now the same problem Is coming it's still slow it takes 2-3 mins to give the output, can you suggest to me what to do?

It's loading the model again from starting.

irexyc commented 4 months ago

I am not familiar with Modal. But according to their docs, the model should not init multiple times during the container_idle_timeout

@app.cls(
    gpu=GPU_CONFIG,
    secrets=[modal.Secret.from_name("huggingface-secret")],
    container_idle_timeout=10 * MINUTES,
)
class Model:
    @modal.enter()
    def load(self):
        """Loads the TRT-LLM engine and configures our tokenizer.

        The @enter decorator ensures that it runs only once per container, when it starts."""
Iven2132 commented 4 months ago

I am not familiar with Modal. But according to their docs, the model should not init multiple times during the container_idle_timeout

@app.cls(
    gpu=GPU_CONFIG,
    secrets=[modal.Secret.from_name("huggingface-secret")],
    container_idle_timeout=10 * MINUTES,
)
class Model:
    @modal.enter()
    def load(self):
        """Loads the TRT-LLM engine and configures our tokenizer.

        The @enter decorator ensures that it runs only once per container, when it starts."""

Yeah, Thank you very much! You are so knowledgeable and helpful. Why don't you work at Modal they are hiring https://jobs.ashbyhq.com/modal I can connect you with the Modal team.

lvhan028 commented 4 months ago

Let’s concentrate on discussing the issue, rather than attempting to lure away my team member right before me

Iven2132 commented 4 months ago

Let’s concentrate on discussing the issue, rather than attempting to lure away my team member right before me

The issue is been fixed, Thank you, guys! Just one last question - Does lmdeploy support all models that fall under the supported model architecture in the Readme?

Iven2132 commented 4 months ago

Also, I got OpenGVLab--InternVL-Chat-V1-5 does not appear to have a file named preprocessor_config.json. error with OpenGVLab--InternVL-Chat-V1-5 model so what's the fix for that? I don't get any error with OpenGVLab/InternVL-Chat-V1-2-Plus then whats the problem with V1-5? I am using internvl-zh-hermes2 in the chat template.

irexyc commented 4 months ago

Does lmdeploy support all models that fall under the supported model architecture in the Readme?

Lmdeploy should support it if the modeling_xxx.py file are same with the official repo. Another chat template may be added if using a different one from official repo when training or tuning the model.

InternVL-Chat-V1-5 has different vision model and the based llm is internlm2 which is different from v1-2 or v1-2. Therefore, some adaptation code shoud be written. This pr supports InternVL-Chat-V1-5.

To use this pr, you shold build lmdeploy yourself according to this doc or install lmdeploy 0.4.0 and replace the installed python code with latest main branch.

Iven2132 commented 4 months ago

Does lmdeploy support all models that fall under the supported model architecture in the Readme?

Lmdeploy should support it if the modeling_xxx.py file are same with the official repo. Another chat template may be added if using a different one from official repo when training or tuning the model.

InternVL-Chat-V1-5 has different vision model and the based llm is internlm2 which is different from v1-2 or v1-2. Therefore, some adaptation code shoud be written. This pr supports InternVL-Chat-V1-5.

To use this pr, you shold build lmdeploy yourself according to this doc or install lmdeploy 0.4.0 and replace the installed python code with latest main branch.

I don't think I can run Docker inside Modal, as build_all_wheel.sh runs some Docker commands, will you release a new version of the package soon?

irexyc commented 4 months ago

We will release 0.4.1 on May 8

Iven2132 commented 4 months ago

We will release 0.4.1 on May 8

OK, Can you tell do InternVL-Chat-V1-5 requires flash attention? because on Modal I can use Flash Attention or Docker at the same time.

Or maybe you can tell how to get export NCCL_ROOT_DIR=/path/to/nccl/build and export NCCL_LIBRARIES=/path/to/nccl/build/lib

Iven2132 commented 4 months ago

Does lmdeploy support all models that fall under the supported model architecture in the Readme?

Lmdeploy should support it if the modeling_xxx.py file are same with the official repo. Another chat template may be added if using a different one from official repo when training or tuning the model.

InternVL-Chat-V1-5 has different vision model and the based llm is internlm2 which is different from v1-2 or v1-2. Therefore, some adaptation code shoud be written. This pr supports InternVL-Chat-V1-5.

To use this pr, you shold build lmdeploy yourself according to this doc or install lmdeploy 0.4.0 and replace the installed python code with latest main branch.

@irexyc Just a note of clarification: Modal doesn't use Docker and I can't run Docker inside Modal. They run containers that largely comply with the OCI image format (aka "Docker image") but the runtime Modal use isn't Docker (They run either in runc [the runtime Docker uses] or gvisor). So I will not be able to run anything that requires Docker inside of Modal. That is, I can't run anything that calls the docker program.

So what are some other ways to use the latest Imdeploy build? I tried the "Build in localhost" option but then I got a lot of path issues.

irexyc commented 4 months ago

The flash_attn does nothing for inference as we acutally inference the llm part with lmdeploy(turbomind backend) instead of transformers. But we use transformers to load the vision model and in the config.json file, this line "attn_implementation": "flash_attention_2" will ask you to install flash_attn when you loading the model with transformers. You can change this line to "attn_implementation": "eager" to remove the need of flash_attn.

I said 'build lmdeploy yourself' does not mean to build lmdeploy in Modal, but build lmeploy whl package on you local machine and install the built whl on Modal. We suggest build whl in docker and you won't meet the path issues.

Or you can fork our repo and use the github action to compile the latest code.

image

The action will failed before publish stage (because you don't set the token of pypi) and when it failed, you can download the whl package (artifact) like this link

image
Iven2132 commented 4 months ago

The flash_attn does nothing for inference as we acutally inference the llm part with lmdeploy(turbomind backend) instead of transformers. But we use transformers to load the vision model and in the config.json file, this line "attn_implementation": "flash_attention_2" will ask you to install flash_attn when you loading the model with transformers. You can change this line to "attn_implementation": "eager" to remove the need of flash_attn.

I said 'build lmdeploy yourself' does not mean to build lmdeploy in Modal, but build lmeploy whl package on you local machine and install the built whl on Modal. We suggest build whl in docker and you won't meet the path issues.

Or you can fork our repo and use the github action to compile the latest code.

image

The action will failed before publish stage (because you don't set the token of pypi) and when it failed, you can download the whl package (artifact) like this link

image

I'm confused I fork the repository and use GitHub Actions to compile the latest code. download the built wheel package (artifact) from the failed action what to do after that? I will still have to build the package right?

is there a Docker file that will build the lmdeploy? I can run a single Docker file in Modal https://modal.com/docs/guide/custom-container#bring-your-own-image-definition-with-from_dockerfile:~:text=Registry%20images.-,Bring%20your%20own%20image%20definition%20with%20.from_dockerfile,-Sometimes%2C%20you%20might

Iven2132 commented 4 months ago

@irexyc btw if you want to play with Modal - They are offering $30 free credits too, as I'm on a Mac what's the NCCL_ROOT_DIR path? I can see three cuda in my usr/local folder cuda cuda-11 cuda-11.8 but when I check nvcc --version It's Build cuda_11.8.r11.8/compiler.31833905_0ff So what CUDA version I'm running exactly?

Iven2132 commented 4 months ago

The flash_attn does nothing for inference as we acutally inference the llm part with lmdeploy(turbomind backend) instead of transformers. But we use transformers to load the vision model and in the config.json file, this line "attn_implementation": "flash_attention_2" will ask you to install flash_attn when you loading the model with transformers. You can change this line to "attn_implementation": "eager" to remove the need of flash_attn.

I said 'build lmdeploy yourself' does not mean to build lmdeploy in Modal, but build lmeploy whl package on you local machine and install the built whl on Modal. We suggest build whl in docker and you won't meet the path issues.

Or you can fork our repo and use the github action to compile the latest code.

image

The action will failed before publish stage (because you don't set the token of pypi) and when it failed, you can download the whl package (artifact) like this link

image

Hi, I made a whl file here it is but when I download that file and install it using pip it gives an error ERROR: lmdeploy-0.4.0-cp310-cp310-manylinux2014_x86_64-XlFsI2YVNbXef3NoivrwF6jnEJNJUo.whl is not a valid wheel filename.

here is that file: https://eiv2wqohq4a67zye.public.blob.vercel-storage.com/lmdeploy-0.4.0-cp310-cp310-manylinux2014_x86_64-XlFsI2YVNbXef3NoivrwF6jnEJNJUo.whl

I'm on Python 3.10

Iven2132 commented 4 months ago

Hi @irexyc and @lvhan028 I managed to install the build. but can you tell me what's the proper way to few-shot prompting (also called in-context learning? How do I give the previous context?

Here is my code, which is currently not working:

        from lmdeploy import client
        handle = client(api_server_url='http://0.0.0.0:23333')
        model_name = handle.available_models[0]
        messages: [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Where was this image taken?"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
          }
        }
      ]
    },
    {
      "role": "assistant",
      "content": [
        {
          "type": "text",
          "text": "London"
        }
      ]
    },
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Where was this image taken?"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "https://cdn.britannica.com/68/170868-050-8DDE8263/Golden-Gate-Bridge-San-Francisco.jpg"
          }
        }
      ]
    },
    {
      "role": "assistant",
      "content": [
        {
          "type": "text",
          "text": "New York City"
        }
      ]
    },
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Do you think all the previous conversation we had, your answer were correct or not? tell me all the correct answer from our revious conversation"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
          }
        }
      ]
    }
  ],
        outputs = handle.chat_completions_v1(
            model=model_name, messages=messagess)
        for out in outputs:
            return out