InternLM / lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
https://lmdeploy.readthedocs.io/en/latest/
Apache License 2.0
4.11k stars 373 forks source link

[Bug] Sending image from gradio interface to 4bit model #2075

Open zhuraromdev opened 1 month ago

zhuraromdev commented 1 month ago

Checklist

Describe the bug

Hello, I have created the space on HF and trying to send the image from gradio input to quanitized model. However, getting an error. Is it possible to read an image from local machine or the only way, how can I send the image to VLLM offline is through url?

Reproduction

Screenshot 2024-07-18 at 13 34 41

Environment

sys.platform: linux
Python: 3.10.14 (main, Jul 12 2024, 13:17:12) [GCC 11.4.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0: Tesla T4
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.3, V12.3.107
GCC: gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
PyTorch: 2.2.2+cu121
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.3.2 (Git Hash 2dc95a2ad0841e29db8b22fbccaf3e5da7992b01)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 12.1
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.9.2
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.2.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF, 

TorchVision: 0.17.2+cu121
LMDeploy: 0.5.1+7263e03
transformers: 4.42.4
gradio: 4.36.1
fastapi: 0.111.0
pydantic: 2.8.2
triton: 2.2.0
NVIDIA Topology: 
        GPU0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      0-3     0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

Error traceback

=== Application stopped (exit code: 1) at 2024-07-18 11:20:18.299319328 UTC ===
Set max length to 16384
Dummy Resized

Convert to turbomind format:   0%|          | 0/32 [00:00<?, ?it/s]
Convert to turbomind format:   3%|▎         | 1/32 [00:01<00:39,  1.26s/it]
Convert to turbomind format:  28%|██▊       | 9/32 [00:04<00:11,  1.93it/s]
Convert to turbomind format:  72%|███████▏  | 23/32 [00:06<00:02,  4.37it/s]
[WARNING] gemm_config.in is not found; using default GEMM algo
                                                                            INFO:httpx:HTTP Request: GET https://checkip.amazonaws.com/ "HTTP/1.1 200 "
Running on local URL:  http://0.0.0.0:7860
INFO:httpx:HTTP Request: GET http://localhost:7860/startup-events "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: HEAD http://localhost:7860/ "HTTP/1.1 200 OK"

To create a public link, set `share=True` in `launch()`.
INFO:httpx:HTTP Request: GET https://api.gradio.app/pkg-version "HTTP/1.1 200 OK"
INFO:matplotlib.font_manager:generated new fontManager
INFO:__main__:Received a new message.
WARNING:__main__:Expected history to be a list of tuples, but got string: You are a friendly Chatbot.
ERROR:__main__:Error during chat completion: '>' not supported between instances of 'Image' and 'int'
zhuraromdev commented 1 month ago

Code:

import gradio as gr
from PIL import Image
import io
import logging
import base64
from lmdeploy import pipeline, TurbomindEngineConfig

# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Model setup
model_id = "PATH_TO_PRIVATE_QUANITIZED_HF_MODEL"
backend_config = TurbomindEngineConfig(cache_max_entry_count=0.2, tp=1)
pipe = pipeline(model_id, backend_config=backend_config)

def respond(
    message,
    image,
    history,
    system_message,
    max_tokens,
    temperature,
    top_p,
):
    logger.info("Received a new message.")
    messages = [{"role": "system", "content": str(system_message)}]  # Ensure system_message is a string

    # Ensure history is a list of tuples
    if isinstance(history, str):
        logger.warning(f"Expected history to be a list of tuples, but got string: {history}")
        history = []
    elif isinstance(history, list):
        for val in history:
            if isinstance(val, list) and len(val) == 2:
                messages.append({"role": "user", "content": val[0]})
                messages.append({"role": "assistant", "content": val[1]})
            else:
                logger.warning(f"Unexpected format in history: {val}")
    else:
        logger.warning(f"Unexpected type for history: {type(history)}")

    messages.append({"role": "user", "content": message})

    image_prompt = None
    if isinstance(image, Image.Image):
        try:
            buffered = io.BytesIO()
            image.save(buffered, format="PNG")
            img_str = base64.b64encode(buffered.getvalue()).decode("utf-8")
            image_prompt = {'type': 'image_url', 'image_url': {'url': 'data:image/png;base64,' + img_str}}
        except Exception as e:
            logger.error(f"Error processing image: {e}")

    prompts = [
        {
            'role': 'user',
            'content': [{'type': 'text', 'text': message}]
        }
    ]

    if image_prompt:
        prompts[0]['content'].append(image_prompt)

    response = ""

    try:
        responses = pipe(prompts, max_tokens=max_tokens, temperature=temperature, top_p=top_p)
        response = responses[0]['content']
        yield response
    except Exception as e:
        logger.error(f"Error during chat completion: {e}")
        yield "An error occurred during the chat completion process."

demo = gr.ChatInterface(
    fn=respond,
    additional_inputs=[
        gr.Textbox(value="You are a friendly Chatbot.", label="System message"),
        gr.Slider(minimum=1, maximum=2048, value=512, step=1, label="Max new tokens"),
        gr.Slider(minimum=0.1, maximum=4.0, value=0.7, step=0.1, label="Temperature"),
        gr.Slider(minimum=0.1, maximum=1.0, value=0.95, step=0.05, label="Top-p (nucleus sampling)"),
        gr.Image(type="pil", label="Input Image (optional)")
    ],
)

if __name__ == "__main__":
    demo.launch()