Closed Iven2132 closed 4 months ago
We have already supported InternVL-Chat-V1-1, InternVL-Chat-V1-2 and InternVL-Chat-V1-2-Plus.
There are some docs about usage: https://github.com/InternLM/lmdeploy/blob/main/docs/en/inference/vl_pipeline.md https://github.com/InternLM/lmdeploy/blob/main/docs/en/serving/api_server_vl.md
All you have to do is change the model id from liuhaotian/llava-v1.6-vicuna-7b
to OpenGVLab/InternVL-Chat-V1-2-Plus
in the docs.
We use the folder name to map chat template. Therefore if you download model yourself from huggingface, it's better to keep the folder name or you have to set ChatTemplateConfig
correctly when you init pipeline or server.
We have already supported InternVL-Chat-V1-1, InternVL-Chat-V1-2 and InternVL-Chat-V1-2-Plus.
There are some docs about usage: https://github.com/InternLM/lmdeploy/blob/main/docs/en/inference/vl_pipeline.md https://github.com/InternLM/lmdeploy/blob/main/docs/en/serving/api_server_vl.md
All you have to do is change the model id from
liuhaotian/llava-v1.6-vicuna-7b
toOpenGVLab/InternVL-Chat-V1-2-Plus
in the docs.We use the folder name to map chat template. Therefore if you download model yourself from huggingface, it's better to keep the folder name or you have to set
ChatTemplateConfig
correctly when you init pipeline or server.
@lvhan028 I'm getting this error: ArgumentError: Please set model_name for /teamspace/studios/this_studio/.cache/huggingface/hub/models--OpenGVLab--InternVL-Chat-V1-2-Plus/snapshots/973b6d6b624d87bfb7e6371680da1c4e2c51a24f
Can you please help me?
my code:
from lmdeploy import pipeline
from lmdeploy.vl import load_image
pipe = pipeline('OpenGVLab/InternVL-Chat-V1-2-Plus')
image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
response = pipe(('describe this image', image))
print(response)
@Iven2132
It seems that they change their repo name recently and our folder name matching strategy failed.
https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2/commit/62d21d2ee98a6fabdea460d17b05e8040f217344
You can pass the chat template when init the pipeline like this:
from lmdeploy import pipeline, ChatTemplateConfig
from lmdeploy.vl import load_image
pipe = pipeline('OpenGVLab/InternVL-Chat-V1-2-Plus', chat_template_config=ChatTemplateConfig(model_name='internvl-zh-hermes2’))
image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
response = pipe(('describe this image', image))
print(response)
@Iven2132
It seems that they change their repo name recently and our folder name matching strategy failed.
https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2/commit/62d21d2ee98a6fabdea460d17b05e8040f217344
You can pass the chat template when init the pipeline like this:
from lmdeploy import pipeline, ChatTemplateConfig from lmdeploy.vl import load_image pipe = pipeline('OpenGVLab/InternVL-Chat-V1-2-Plus', chat_template_config=ChatTemplateConfig(model_name='internvl-zh-hermes2’)) image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg') response = pipe(('describe this image', image)) print(response)
I'm getting RuntimeError: [TM][ERROR] Assertion fail: /lmdeploy/src/turbomind/triton_backend/llama/LlamaTritonModel.cc:417
error
I can't reproduce the error.
Could you open the logger by
export TM_LOG_LEVEL=DEBUG
export TM_DEBUG_LEVEL=DEBUG
and then print the all log lines when creating the pipeline
from lmdeploy import pipeline, ChatTemplateConfig
pipe = pipeline('/home/chenxin/InternVL-Chat-V1-5', chat_template_config=ChatTemplateConfig(model_name='internvl-zh-hermes2'))
@Iven2132 what's your GPU model?
@Iven2132 what's your GPU model?
I'm using A100 80 GB.
I can't reproduce the error.
Could you open the logger by
export TM_LOG_LEVEL=DEBUG export TM_DEBUG_LEVEL=DEBUG
and then print the all log lines when creating the pipeline
from lmdeploy import pipeline, ChatTemplateConfig pipe = pipeline('/home/chenxin/InternVL-Chat-V1-5', chat_template_config=ChatTemplateConfig(model_name='internvl-zh-hermes2'))
@irexyc I'm getting this error now:
[/usr/local/lib/python3.11/site-packages/tqdm/auto.py:21](https://rufs8usy7vtn6i.r5.modal.host/usr/local/lib/python3.11/site-packages/tqdm/auto.py#line=20): TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
Fetching 35 files: 100%|██████████| 35/35 [05:43<00:00, 9.83s[/it](https://rufs8usy7vtn6i.r5.modal.host/it)]
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
File [/usr/local/lib/python3.11/site-packages/transformers/utils/import_utils.py:1510](https://rufs8usy7vtn6i.r5.modal.host/usr/local/lib/python3.11/site-packages/transformers/utils/import_utils.py#line=1509), in _LazyModule._get_module(self, module_name)
1509 try:
-> 1510 return importlib.import_module("." + module_name, self.__name__)
1511 except Exception as e:
File [/usr/local/lib/python3.11/importlib/__init__.py:126](https://rufs8usy7vtn6i.r5.modal.host/usr/local/lib/python3.11/importlib/__init__.py#line=125), in import_module(name, package)
125 level += 1
--> 126 return _bootstrap._gcd_import(name[level:], package, level)
File <frozen importlib._bootstrap>:1204, in _gcd_import(name, package, level)
File <frozen importlib._bootstrap>:1176, in _find_and_load(name, import_)
File <frozen importlib._bootstrap>:1147, in _find_and_load_unlocked(name, import_)
File <frozen importlib._bootstrap>:690, in _load_unlocked(spec)
File <frozen importlib._bootstrap_external>:940, in exec_module(self, module)
File <frozen importlib._bootstrap>:241, in _call_with_frames_removed(f, *args, **kwds)
File [/usr/local/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py:55](https://rufs8usy7vtn6i.r5.modal.host/usr/local/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py#line=54)
54 if is_flash_attn_2_available():
---> 55 from flash_attn import flash_attn_func, flash_attn_varlen_func
56 from flash_attn.bert_padding import index_first_axis, pad_input, unpad_input # noqa
File [/usr/local/lib/python3.11/site-packages/flash_attn/__init__.py:3](https://rufs8usy7vtn6i.r5.modal.host/usr/local/lib/python3.11/site-packages/flash_attn/__init__.py#line=2)
1 __version__ = "2.5.7"
----> 3 from flash_attn.flash_attn_interface import (
4 flash_attn_func,
5 flash_attn_kvpacked_func,
6 flash_attn_qkvpacked_func,
7 flash_attn_varlen_func,
8 flash_attn_varlen_kvpacked_func,
9 flash_attn_varlen_qkvpacked_func,
10 flash_attn_with_kvcache,
11 )
File [/usr/local/lib/python3.11/site-packages/flash_attn/flash_attn_interface.py:10](https://rufs8usy7vtn6i.r5.modal.host/usr/local/lib/python3.11/site-packages/flash_attn/flash_attn_interface.py#line=9)
8 # isort: off
9 # We need to import the CUDA kernels after importing torch
---> 10 import flash_attn_2_cuda as flash_attn_cuda
12 # isort: on
ImportError: [/usr/local/lib/python3.11/site-packages/flash_attn_2_cuda.cpython-311-x86_64-linux-gnu.so](https://rufs8usy7vtn6i.r5.modal.host/usr/local/lib/python3.11/site-packages/flash_attn_2_cuda.cpython-311-x86_64-linux-gnu.so): undefined symbol: _ZN3c104cuda9SetDeviceEi
The above exception was the direct cause of the following exception:
RuntimeError Traceback (most recent call last)
Cell In[11], line 4
1 from lmdeploy import pipeline, ChatTemplateConfig
2 from lmdeploy.vl import load_image
----> 4 pipe = pipeline(
5 'OpenGVLab[/InternVL-Chat-V1-2-Plus](https://rufs8usy7vtn6i.r5.modal.host/InternVL-Chat-V1-2-Plus)',
6 chat_template_config=ChatTemplateConfig(model_name='internvl-zh-hermes2')
7 )
9 image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
10 response = pipe(('describe this image', image))
File /usr/local/lib/python3.11/site-packages/lmdeploy/api.py:94, in pipeline(model_path, model_name, backend_config, chat_template_config, log_level, **kwargs)
91 else:
92 tp = 1 if backend_config is None else backend_config.tp
---> 94 return pipeline_class(model_path,
95 model_name=model_name,
96 backend=backend,
97 backend_config=backend_config,
98 chat_template_config=chat_template_config,
99 tp=tp,
100 **kwargs)
File [/usr/local/lib/python3.11/site-packages/lmdeploy/serve/vl_async_engine.py:16](https://rufs8usy7vtn6i.r5.modal.host/usr/local/lib/python3.11/site-packages/lmdeploy/serve/vl_async_engine.py#line=15), in VLAsyncEngine.__init__(self, model_path, **kwargs)
15 def __init__(self, model_path: str, **kwargs) -> None:
---> 16 self.vl_encoder = ImageEncoder(model_path)
17 super().__init__(model_path, **kwargs)
18 if self.model_name == 'base':
File [/usr/local/lib/python3.11/site-packages/lmdeploy/vl/engine.py:68](https://rufs8usy7vtn6i.r5.modal.host/usr/local/lib/python3.11/site-packages/lmdeploy/vl/engine.py#line=67), in ImageEncoder.__init__(self, model_path, max_batch_size)
67 def __init__(self, model_path: str, max_batch_size: int = 16):
---> 68 self.model = load_vl_model(model_path)
69 self.max_batch_size = max_batch_size
70 self.loop = asyncio.new_event_loop()
File [/usr/local/lib/python3.11/site-packages/lmdeploy/vl/model/builder.py:42](https://rufs8usy7vtn6i.r5.modal.host/usr/local/lib/python3.11/site-packages/lmdeploy/vl/model/builder.py#line=41), in load_vl_model(model_path)
40 return Xcomposer2VisionModel(model_path)
41 if arch == 'InternVLChatModel':
---> 42 return InternVLVisionModel(model_path)
43 if arch == 'MiniGeminiLlamaForCausalLM':
44 return MiniGeminiVisionModel(model_path)
File [/usr/local/lib/python3.11/site-packages/lmdeploy/vl/model/internvl.py:19](https://rufs8usy7vtn6i.r5.modal.host/usr/local/lib/python3.11/site-packages/lmdeploy/vl/model/internvl.py#line=18), in InternVLVisionModel.__init__(self, model_path, device)
17 self.model_path = model_path
18 self.device = device
---> 19 self.build_model()
File [/usr/local/lib/python3.11/site-packages/lmdeploy/vl/model/internvl.py:27](https://rufs8usy7vtn6i.r5.modal.host/usr/local/lib/python3.11/site-packages/lmdeploy/vl/model/internvl.py#line=26), in InternVLVisionModel.build_model(self)
24 with init_empty_weights():
25 config = AutoConfig.from_pretrained(self.model_path,
26 trust_remote_code=True)
---> 27 model = AutoModel.from_config(config, trust_remote_code=True)
28 del model.language_model
29 model.half()
File [/usr/local/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py:428](https://rufs8usy7vtn6i.r5.modal.host/usr/local/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py#line=427), in _BaseAutoModelClass.from_config(cls, config, **kwargs)
426 else:
427 repo_id = config.name_or_path
--> 428 model_class = get_class_from_dynamic_module(class_ref, repo_id, **kwargs)
429 if os.path.isdir(config._name_or_path):
430 model_class.register_for_auto_class(cls.__name__)
File [/usr/local/lib/python3.11/site-packages/transformers/dynamic_module_utils.py:501](https://rufs8usy7vtn6i.r5.modal.host/usr/local/lib/python3.11/site-packages/transformers/dynamic_module_utils.py#line=500), in get_class_from_dynamic_module(class_reference, pretrained_model_name_or_path, cache_dir, force_download, resume_download, proxies, token, revision, local_files_only, repo_type, code_revision, **kwargs)
488 # And lastly we get the class inside our newly created module
489 final_module = get_cached_module_file(
490 repo_id,
491 module_file + ".py",
(...)
499 repo_type=repo_type,
500 )
--> 501 return get_class_in_module(class_name, final_module)
File [/usr/local/lib/python3.11/site-packages/transformers/dynamic_module_utils.py:201](https://rufs8usy7vtn6i.r5.modal.host/usr/local/lib/python3.11/site-packages/transformers/dynamic_module_utils.py#line=200), in get_class_in_module(class_name, module_path)
199 name = os.path.normpath(module_path).replace(".py", "").replace(os.path.sep, ".")
200 module_path = str(Path(HF_MODULES_CACHE) [/](https://rufs8usy7vtn6i.r5.modal.host/) module_path)
--> 201 module = importlib.machinery.SourceFileLoader(name, module_path).load_module()
202 return getattr(module, class_name)
File <frozen importlib._bootstrap_external>:605, in _check_name_wrapper(self, name, *args, **kwargs)
File <frozen importlib._bootstrap_external>:1120, in load_module(self, fullname)
File <frozen importlib._bootstrap_external>:945, in load_module(self, fullname)
File <frozen importlib._bootstrap>:290, in _load_module_shim(self, fullname)
File <frozen importlib._bootstrap>:721, in _load(spec)
File <frozen importlib._bootstrap>:690, in _load_unlocked(spec)
File <frozen importlib._bootstrap_external>:940, in exec_module(self, module)
File <frozen importlib._bootstrap>:241, in _call_with_frames_removed(f, *args, **kwds)
File [~/.cache/huggingface/modules/transformers_modules/973b6d6b624d87bfb7e6371680da1c4e2c51a24f/modeling_internvl_chat.py:13](https://rufs8usy7vtn6i.r5.modal.host/lab/tree/~/.cache/huggingface/modules/transformers_modules/973b6d6b624d87bfb7e6371680da1c4e2c51a24f/modeling_internvl_chat.py#line=12)
11 from torch import nn
12 from torch.nn import CrossEntropyLoss
---> 13 from transformers import GenerationConfig, LlamaForCausalLM, LlamaTokenizer
14 from transformers.generation.logits_process import LogitsProcessorList
15 from transformers.generation.stopping_criteria import StoppingCriteriaList
File <frozen importlib._bootstrap>:1229, in _handle_fromlist(module, fromlist, import_, recursive)
File [/usr/local/lib/python3.11/site-packages/transformers/utils/import_utils.py:1501](https://rufs8usy7vtn6i.r5.modal.host/usr/local/lib/python3.11/site-packages/transformers/utils/import_utils.py#line=1500), in _LazyModule.__getattr__(self, name)
1499 elif name in self._class_to_module.keys():
1500 module = self._get_module(self._class_to_module[name])
-> 1501 value = getattr(module, name)
1502 else:
1503 raise AttributeError(f"module {self.__name__} has no attribute {name}")
File [/usr/local/lib/python3.11/site-packages/transformers/utils/import_utils.py:1500](https://rufs8usy7vtn6i.r5.modal.host/usr/local/lib/python3.11/site-packages/transformers/utils/import_utils.py#line=1499), in _LazyModule.__getattr__(self, name)
1498 value = self._get_module(name)
1499 elif name in self._class_to_module.keys():
-> 1500 module = self._get_module(self._class_to_module[name])
1501 value = getattr(module, name)
1502 else:
File [/usr/local/lib/python3.11/site-packages/transformers/utils/import_utils.py:1512](https://rufs8usy7vtn6i.r5.modal.host/usr/local/lib/python3.11/site-packages/transformers/utils/import_utils.py#line=1511), in _LazyModule._get_module(self, module_name)
1510 return importlib.import_module("." + module_name, self.__name__)
1511 except Exception as e:
-> 1512 raise RuntimeError(
1513 f"Failed to import {self.__name__}.{module_name} because of the following error (look up to see its"
1514 f" traceback):\n{e}"
1515 ) from e
RuntimeError: Failed to import transformers.models.llama.modeling_llama because of the following error (look up to see its traceback):
[/usr/local/lib/python3.11/site-packages/flash_attn_2_cuda.cpython-311-x86_64-linux-gnu.so](https://rufs8usy7vtn6i.r5.modal.host/usr/local/lib/python3.11/site-packages/flash_attn_2_cuda.cpython-311-x86_64-linux-gnu.so): undefined symbol: _ZN3c104cuda9SetDeviceEi
ImportError: /usr/local/lib/python3.11/site-packages/flash_attn_2_cuda.cpython-311-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEi
I think it is an environment issue. And I find this https://github.com/Dao-AILab/flash-attention/issues/620
But I have latest version of flash attention, Can you run the model on notebook and share with me?
I wonder if InternLM/lmdeploy uses TensorRT + Triton?
On Thu, 25 Apr 2024, 23:01 Chen Xin, @.***> wrote:
ImportError: /usr/local/lib/python3.11/site-packages/flash_attn_2_cuda.cpython-311-x86_64-linux-gnu.so https://rufs8usy7vtn6i.r5.modal.host/usr/local/lib/python3.11/site-packages/flash_attn_2_cuda.cpython-311-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEi
I think it is an environment issue. And I find this Dao-AILab/flash-attention#620 https://github.com/Dao-AILab/flash-attention/issues/620
— Reply to this email directly, view it on GitHub https://github.com/InternLM/lmdeploy/issues/1495#issuecomment-2077804689, or unsubscribe https://github.com/notifications/unsubscribe-auth/A3UMSYVMXXIEP7MFVF2QRW3Y7E4XHAVCNFSM6AAAAABGYJFB6SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZXHAYDINRYHE . You are receiving this because you were mentioned.Message ID: @.***>
For vison language model, there are two parts: vision and language.
For language model(llm), lmdeploy has two backends: turbomind (developed upon NVIDIA/FasterTransformer) and pytorch (developed upon pytorch and openai/triton).
For vision model, we reuse their code to do the inference.
From your log, the error happends on AutoModel.from_config
. Therefore, I think it's an environment problem. You can try if you can run their demo with transformers https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2-Plus#model-usage
For vison language model, there are two parts: vision and language.
For language model(llm), lmdeploy has two backends: turbomind (developed upon NVIDIA/FasterTransformer) and pytorch (developed upon pytorch and openai/triton).
For vision model, we reuse their code to do the inference.
From your log, the error happends on
AutoModel.from_config
. Therefore, I think it's an environment problem. You can try if you can run their demo with transformers https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2-Plus#model-usage
I got error from that too, How to setup environment? Do you have any notebook to do that?
For vison language model, there are two parts: vision and language.
For language model(llm), lmdeploy has two backends: turbomind (developed upon NVIDIA/FasterTransformer) and pytorch (developed upon pytorch and openai/triton).
For vision model, we reuse their code to do the inference.
From your log, the error happends on
AutoModel.from_config
. Therefore, I think it's an environment problem. You can try if you can run their demo with transformers https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2-Plus#model-usage
@irexyc I got this error: ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ ValueError: Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes. You passed torch.float32, this might lead to unexpected behaviour.
Any fix?
irexyc
@irexyc Any idea what can I do?
@Iven2132
It seems you fixed the flash_attn import issue.
You can upgrade your transformers
by pip install -U transformers
. After 4.37.0, they print a warning message instead of raise error.
https://github.com/huggingface/transformers/blob/v4.37.0/src/transformers/modeling_utils.py#L1509-L1513
@Iven2132
It seems you fixed the flash_attn import issue.
You can upgrade your
transformers
bypip install -U transformers
. After 4.37.0, they print a warning message instead of raise error. https://github.com/huggingface/transformers/blob/v4.37.0/src/transformers/modeling_utils.py#L1509-L1513
@irexyc Yeah, It does work now, but it's super slow it takes 5-6 mins to get the output, any way to load the model fast for a faster inference?
from lmdeploy import pipeline, ChatTemplateConfig
from lmdeploy.vl import load_image
pipe = pipeline('OpenGVLab/InternVL-Chat-V1-2-Plus', chat_template_config=ChatTemplateConfig(model_name='internvl-zh-hermes2'))
image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
response = pipe(('describe this image', image))
print(response)
Yeah, It does work now, but it's super slow it takes 5-6 mins to get the output, any way to load the model fast for a faster inference?
Does 5-6 mins including loading the model? If so, why do you init the pipeline multiple times? You can init it once and reuse the pipeline or you can init a serve and use client api. If the time doesn't include the loading time, you can share the prompt and image, I will check it tomorrow.
BTW,it takes about 40 seconds to init the pipeline, how much memory does your machine have?
Yeah, It does work now, but it's super slow it takes 5-6 mins to get the output, any way to load the model fast for a faster inference?
Does 5-6 mins including loading the model? If so, why do you init the pipeline multiple times? You can init it once and reuse the pipeline or you can init a serve and use client api. If the time doesn't include the loading time, you can share the prompt and image, I will check it tomorrow.
BTW,it takes about 40 seconds to init the pipeline, how much memory does your machine have?
@irexyc I'm using Modal for deploying and running A100 80 GB. Can you share your code? How to deploy properly is it possible to get output in 4-5 seconds on A100 or H100?
The cost time is related to image processing and the number of input and output token. And you should not init the model every time you received a request.
A reasonable logic is to initialize the model once when the program starts, and use the initialized model to handle reveived request.
The lmdeploy.pipeline is for offline usage and pipeline.__call__
is not thread safe. For your situation, you could use lmdeploy.api.serve
and lmdeploy.api.client
You can edit the code upon this example. In load
funciton, use lmdeploy.api.serve
to init the model. In generate
function, use lmdeploy.api.client
to communicate with the model created by lmdeploy.api.serve
Thanks! Can I please get the docs for using lmdeploy.api.client and lmdeploy.api.serve
On Sat, 27 Apr 2024, 23:15 Chen Xin, @.***> wrote:
The cost time is related to image processing and the number of input and output token. And you should not init the model every time you received a request.
A reasonable logic is to initialize the model once when the program starts, and use the initialized model to handle reveived request.
The lmdeploy.pipeline is for offline usage and pipeline.call is not thread safe. For your situation, you could use lmdeploy.api.serve and lmdeploy.api.client
You can edit the code upon this example https://modal.com/docs/examples/trtllm_llama. In load funciton, use lmdeploy.api.serve to init the model. In generate function, use lmdeploy.api.client to communicate with the model created by lmdeploy.api.serve
— Reply to this email directly, view it on GitHub https://github.com/InternLM/lmdeploy/issues/1495#issuecomment-2081111824, or unsubscribe https://github.com/notifications/unsubscribe-auth/A3UMSYV73S6EBX6XGCQMGP3Y7PP3RAVCNFSM6AAAAABGYJFB6SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBRGEYTCOBSGQ . You are receiving this because you were mentioned.Message ID: @.***>
We don't have docs about it, but the usage is similiar to serve
and client
in DeepSpeed-MII
https://github.com/microsoft/DeepSpeed-MII/blob/main/docs/source/quick-start.rst#run-a-persistent-deployment
In short, a server is started in the background (new process), and the client is to comminicate to the serve.
We don't have docs about it, but the usage is similiar to
serve
andclient
in DeepSpeed-MII https://github.com/microsoft/DeepSpeed-MII/blob/main/docs/source/quick-start.rst#run-a-persistent-deploymentIn short, a server is started in the background (new process), and the client is to comminicate to the serve.
we have the user guide please check this one: https://lmdeploy.readthedocs.io/en/latest/serving/api_server_vl.html
We don't have docs about it, but the usage is similiar to
serve
andclient
in DeepSpeed-MII https://github.com/microsoft/DeepSpeed-MII/blob/main/docs/source/quick-start.rst#run-a-persistent-deploymentIn short, a server is started in the background (new process), and the client is to comminicate to the serve.
I tried that, but I couldn't figure it out, can you please help set things up?
import os
import time
import modal
GPU_CONFIG = modal.gpu.A100(memory=80, count=1)
vllm_image = (
modal.Image.micromamba()
.apt_install("git")
.micromamba_install(
"cudatoolkit=11.8",
"cudnn=8.1.0",
"cuda-nvcc",
channels=["conda-forge", "nvidia"],
)
.pip_install(
"lmdeploy==0.4.0",
"wheel==0.43.0",
"torch==2.2.1",
"torchvision==0.17.1",
"transformers==4.40.0",
"timm==0.9.12",
"Pillow==10.3.0",
"vllm==0.4.1",
"peft==0.8.2",
"lmdeploy==0.4.0",
)
.run_commands("pip install flash-attn==2.5.2 --no-build-isolation")
.env({"CUDA_HOME": "/usr/local/cuda"})
)
app = modal.App("test-app")
@app.cls(
gpu=GPU_CONFIG,
timeout=1200,
container_idle_timeout=1200,
allow_concurrent_inputs=10,
image=vllm_image,
)
class Model:
@modal.enter()
def start_engine(self):
from lmdeploy import pipeline, ChatTemplateConfig
from lmdeploy.vl import load_image
import lmdeploy
client = lmdeploy.api.client("OpenGVLab/InternVL-Chat-V1-2-Plus", chat_template_config=ChatTemplateConfig(model_name='internvl-zh-hermes2'))
print(client)
@modal.method()
async def completion_stream(self, user_question):
from lmdeploy.vl import load_image
import lmdeploy
image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
response = lmdeploy.api.serve("OpenGVLab/InternVL-Chat-V1-2-Plus", chat_template_config=ChatTemplateConfig(model_name='internvl-zh-hermes2'))
print(response)
init the server in model initialization and use client for user request.
some thing like this:
...
def start_engine(self):
from lmdeploy import serve, ChatTemplateConfig
self.server = serve('OpenGVLab/InternVL-Chat-V1-2-Plus',
chat_template_config=ChatTemplateConfig(model_name='internvl-zh-hermes2'),
server_name='0.0.0.0',
server_port=23333)
...
async def completion_stream(self, user_question):
...
from lmdeploy import client
handle = client(api_server_url='http://0.0.0.0:23333')
model_name = handle.available_models[0]
# construct message, openai format
messages = {
'role': 'user',
'content': [
{
'type': 'text',
'text': 'Describe the image please',
},
{
'type': 'image_url',
'image_url':
{
'url':
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg',
},
}
],
}
# handle outputs, you should process and output the outputs.
outputs = handle.chat_completions_v1(model=model_name, messages=messages)
for out in outputs:
print(out)
...
init the server in model initialization and use client for user request.
some thing like this:
... def start_engine(self): from lmdeploy import serve, ChatTemplateConfig self.server = serve('OpenGVLab/InternVL-Chat-V1-2-Plus', chat_template_config=ChatTemplateConfig(model_name='internvl-zh-hermes2'), server_name='0.0.0.0', server_port=23333) ... async def completion_stream(self, user_question): ... from lmdeploy import client handle = client(api_server_url='http://0.0.0.0:23333') model_name = handle.available_models[0] # construct message, openai format messages = { 'role': 'user', 'content': [ { 'type': 'text', 'text': 'Describe the image please', }, { 'type': 'image_url', 'image_url': { 'url': 'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg', }, } ], } # handle outputs, you should process and output the outputs. outputs = handle.chat_completions_v1(model=model_name, messages=messages) for out in outputs: print(out) ...
Thanks🙏 But I'm not sure why I'm getting this input should be a valid string:
Error:
"detail": [
{
"type": "string_type",
"loc": [
"body",
"messages",
"str"
],
"msg": "Input should be a valid string",
"input": {
"role": "user",
"content": [
{
"type": "text",
"text": "Describe the image please"
},
{
"type": "image_url",
"image_url": {
"url": "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg"
}
}
]
}
},
{
"type": "list_type",
"loc": [
"body",
"messages",
"list[dict[str,any]]"
],
"msg": "Input should be a valid list",
"input": {
"role": "user",
"content": [
{
"type": "text",
"text": "Describe the image please"
},
{
"type": "image_url",
"image_url": {
"url": "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg"
}
}
]
}
}
]
@Iven2132
change this line
outputs = handle.chat_completions_v1(model=model_name, messages=[messages])
@irexyc It was a format issue, I Fixed that but now the same problem Is coming it's still slow it takes 2-3 mins to give the output, can you suggest to me what to do?
It's loading the model again from starting.
I am not familiar with Modal. But according to their docs, the model should not init multiple times during the container_idle_timeout
@app.cls(
gpu=GPU_CONFIG,
secrets=[modal.Secret.from_name("huggingface-secret")],
container_idle_timeout=10 * MINUTES,
)
class Model:
@modal.enter()
def load(self):
"""Loads the TRT-LLM engine and configures our tokenizer.
The @enter decorator ensures that it runs only once per container, when it starts."""
I am not familiar with Modal. But according to their docs, the model should not init multiple times during the
container_idle_timeout
@app.cls( gpu=GPU_CONFIG, secrets=[modal.Secret.from_name("huggingface-secret")], container_idle_timeout=10 * MINUTES, ) class Model: @modal.enter() def load(self): """Loads the TRT-LLM engine and configures our tokenizer. The @enter decorator ensures that it runs only once per container, when it starts."""
Yeah, Thank you very much! You are so knowledgeable and helpful. Why don't you work at Modal they are hiring https://jobs.ashbyhq.com/modal I can connect you with the Modal team.
Let’s concentrate on discussing the issue, rather than attempting to lure away my team member right before me
Let’s concentrate on discussing the issue, rather than attempting to lure away my team member right before me
The issue is been fixed, Thank you, guys! Just one last question - Does lmdeploy support all models that fall under the supported model architecture in the Readme?
Also, I got OpenGVLab--InternVL-Chat-V1-5 does not appear to have a file named preprocessor_config.json
. error with OpenGVLab--InternVL-Chat-V1-5 model so what's the fix for that? I don't get any error with OpenGVLab/InternVL-Chat-V1-2-Plus then whats the problem with V1-5? I am using internvl-zh-hermes2 in the chat template.
Does lmdeploy support all models that fall under the supported model architecture in the Readme?
Lmdeploy should support it if the modeling_xxx.py file are same with the official repo. Another chat template may be added if using a different one from official repo when training or tuning the model.
InternVL-Chat-V1-5 has different vision model and the based llm is internlm2 which is different from v1-2 or v1-2. Therefore, some adaptation code shoud be written. This pr supports InternVL-Chat-V1-5.
To use this pr, you shold build lmdeploy yourself according to this doc or install lmdeploy 0.4.0 and replace the installed python code with latest main branch.
Does lmdeploy support all models that fall under the supported model architecture in the Readme?
Lmdeploy should support it if the modeling_xxx.py file are same with the official repo. Another chat template may be added if using a different one from official repo when training or tuning the model.
InternVL-Chat-V1-5 has different vision model and the based llm is internlm2 which is different from v1-2 or v1-2. Therefore, some adaptation code shoud be written. This pr supports InternVL-Chat-V1-5.
To use this pr, you shold build lmdeploy yourself according to this doc or install lmdeploy 0.4.0 and replace the installed python code with latest main branch.
I don't think I can run Docker inside Modal, as build_all_wheel.sh runs some Docker commands, will you release a new version of the package soon?
We will release 0.4.1 on May 8
We will release 0.4.1 on May 8
OK, Can you tell do InternVL-Chat-V1-5 requires flash attention? because on Modal I can use Flash Attention or Docker at the same time.
Or maybe you can tell how to get export NCCL_ROOT_DIR=/path/to/nccl/build and export NCCL_LIBRARIES=/path/to/nccl/build/lib
Does lmdeploy support all models that fall under the supported model architecture in the Readme?
Lmdeploy should support it if the modeling_xxx.py file are same with the official repo. Another chat template may be added if using a different one from official repo when training or tuning the model.
InternVL-Chat-V1-5 has different vision model and the based llm is internlm2 which is different from v1-2 or v1-2. Therefore, some adaptation code shoud be written. This pr supports InternVL-Chat-V1-5.
To use this pr, you shold build lmdeploy yourself according to this doc or install lmdeploy 0.4.0 and replace the installed python code with latest main branch.
@irexyc Just a note of clarification: Modal doesn't use Docker and I can't run Docker inside Modal. They run containers that largely comply with the OCI image format (aka "Docker image") but the runtime Modal use isn't Docker (They run either in runc [the runtime Docker uses] or gvisor). So I will not be able to run anything that requires Docker inside of Modal. That is, I can't run anything that calls the docker program.
So what are some other ways to use the latest Imdeploy build? I tried the "Build in localhost" option but then I got a lot of path issues.
The flash_attn does nothing for inference as we acutally inference the llm part with lmdeploy(turbomind backend) instead of transformers. But we use transformers to load the vision model and in the config.json file, this line "attn_implementation": "flash_attention_2"
will ask you to install flash_attn when you loading the model with transformers. You can change this line to "attn_implementation": "eager"
to remove the need of flash_attn.
I said 'build lmdeploy yourself' does not mean to build lmdeploy in Modal, but build lmeploy whl package on you local machine and install the built whl on Modal. We suggest build whl in docker and you won't meet the path issues.
Or you can fork our repo and use the github action to compile the latest code.
The action will failed before publish stage (because you don't set the token of pypi) and when it failed, you can download the whl package (artifact) like this link
The flash_attn does nothing for inference as we acutally inference the llm part with lmdeploy(turbomind backend) instead of transformers. But we use transformers to load the vision model and in the config.json file, this line
"attn_implementation": "flash_attention_2"
will ask you to install flash_attn when you loading the model with transformers. You can change this line to"attn_implementation": "eager"
to remove the need of flash_attn.I said 'build lmdeploy yourself' does not mean to build lmdeploy in Modal, but build lmeploy whl package on you local machine and install the built whl on Modal. We suggest build whl in docker and you won't meet the path issues.
Or you can fork our repo and use the github action to compile the latest code.
The action will failed before publish stage (because you don't set the token of pypi) and when it failed, you can download the whl package (artifact) like this link
I'm confused I fork the repository and use GitHub Actions to compile the latest code. download the built wheel package (artifact) from the failed action what to do after that? I will still have to build the package right?
is there a Docker file that will build the lmdeploy? I can run a single Docker file in Modal https://modal.com/docs/guide/custom-container#bring-your-own-image-definition-with-from_dockerfile:~:text=Registry%20images.-,Bring%20your%20own%20image%20definition%20with%20.from_dockerfile,-Sometimes%2C%20you%20might
@irexyc btw if you want to play with Modal - They are offering $30 free credits too, as I'm on a Mac what's the NCCL_ROOT_DIR path? I can see three cuda in my usr/local folder cuda cuda-11 cuda-11.8 but when I check nvcc --version It's Build cuda_11.8.r11.8/compiler.31833905_0ff So what CUDA version I'm running exactly?
The flash_attn does nothing for inference as we acutally inference the llm part with lmdeploy(turbomind backend) instead of transformers. But we use transformers to load the vision model and in the config.json file, this line
"attn_implementation": "flash_attention_2"
will ask you to install flash_attn when you loading the model with transformers. You can change this line to"attn_implementation": "eager"
to remove the need of flash_attn.I said 'build lmdeploy yourself' does not mean to build lmdeploy in Modal, but build lmeploy whl package on you local machine and install the built whl on Modal. We suggest build whl in docker and you won't meet the path issues.
Or you can fork our repo and use the github action to compile the latest code.
The action will failed before publish stage (because you don't set the token of pypi) and when it failed, you can download the whl package (artifact) like this link
Hi, I made a whl file here it is but when I download that file and install it using pip it gives an error ERROR: lmdeploy-0.4.0-cp310-cp310-manylinux2014_x86_64-XlFsI2YVNbXef3NoivrwF6jnEJNJUo.whl is not a valid wheel filename.
here is that file: https://eiv2wqohq4a67zye.public.blob.vercel-storage.com/lmdeploy-0.4.0-cp310-cp310-manylinux2014_x86_64-XlFsI2YVNbXef3NoivrwF6jnEJNJUo.whl
I'm on Python 3.10
Hi @irexyc and @lvhan028 I managed to install the build. but can you tell me what's the proper way to few-shot prompting (also called in-context learning? How do I give the previous context?
Here is my code, which is currently not working:
from lmdeploy import client
handle = client(api_server_url='http://0.0.0.0:23333')
model_name = handle.available_models[0]
messages: [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Where was this image taken?"
},
{
"type": "image_url",
"image_url": {
"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
}
}
]
},
{
"role": "assistant",
"content": [
{
"type": "text",
"text": "London"
}
]
},
{
"role": "user",
"content": [
{
"type": "text",
"text": "Where was this image taken?"
},
{
"type": "image_url",
"image_url": {
"url": "https://cdn.britannica.com/68/170868-050-8DDE8263/Golden-Gate-Bridge-San-Francisco.jpg"
}
}
]
},
{
"role": "assistant",
"content": [
{
"type": "text",
"text": "New York City"
}
]
},
{
"role": "user",
"content": [
{
"type": "text",
"text": "Do you think all the previous conversation we had, your answer were correct or not? tell me all the correct answer from our revious conversation"
},
{
"type": "image_url",
"image_url": {
"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
}
}
]
}
],
outputs = handle.chat_completions_v1(
model=model_name, messages=messagess)
for out in outputs:
return out
[### Motivation
OpenGVLab/InternVL-Chat-V1-2-Plus is a open source alternative to GPT-4V. Can we please have support for that?
Related resources
https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2-Plus