Open JiayiGuo821 opened 3 months ago
I think most viable way is to use sglang's batch_run
interface.
https://github.com/EvolvingLMMs-Lab/sglang/tree/dev/onevision_main
after launching the backend service, run this file.
You can use the early feature and see the PR.
https://github.com/sgl-project/sglang/pull/1123
We will update both our repo and sglang side to make sure everyone can use it more convienient.
Hi @Luodian Have you integrated it with SGLang frontend inference? I'm guessing the image and video preprocessing isn't done correctly I can't get it to work:
def image_qa(s, image_path, question):
s += sgl.user(sgl.image(image_path) + question)
s += sgl.assistant(sgl.gen("answer"))
def single():
state = image_qa.run(
image_path="images/cat.jpeg", question="What is this?", max_new_tokens=128
)
print(state, "\n")
This is an alternative solution without using SGLang. I started with the sample code from HF (https://huggingface.co/lmms-lab/llava-onevision-qwen2-7b-ov) and did two key changes.
First, after creating the model, you need to change the configuration from right padding to left padding
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map)
model.config.tokenizer_padding_side = 'left' # Use left padding for batch processing
Second, you need to create a list of strings showing the modality of each element in the batch. In my case I just used images. Then, pass the list to the model's generate method.
modalities = ["image" for _ in images] # Repeat modality for every image in the batch
cont = model.generate(
input_ids_repeated,
images=image_tensor,
image_sizes=image_sizes,
modalities=modalities,
do_sample=False,
temperature=0,
max_new_tokens=4096,
)
Here is the complete script:
from llava.model.builder import load_pretrained_model
from llava.mm_utils import process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
from llava.conversation import conv_templates
from PIL import Image
import copy
import torch
import os
import warnings
warnings.filterwarnings("ignore")
pretrained = "lmms-lab/llava-onevision-qwen2-7b-ov"
model_name = "llava_qwen"
device = "cuda"
device_map = "auto"
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map)
model.config.tokenizer_padding_side = 'left' # Use left padding for batch processing
model.eval()
images_dir = "test-images"
image_files = [file for file in os.listdir(images_dir) if file.endswith('.jpg')]
images = []
for file in image_files:
image_path = os.path.join(images_dir, file)
images.append(Image.open(image_path))
image_tensor = process_images(images, image_processor, model.config)
image_tensor = [_image.to(dtype=torch.float16, device=device) for _image in image_tensor]
conv_template = "qwen_1_5" # Make sure you use correct chat template for different models
question = DEFAULT_IMAGE_TOKEN + "\nProvided there is sufficient geographical and temporal information, provide a short description of the image based on the buildings, weather, objects and environment."
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()
input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
input_ids_repeated = input_ids.repeat(len(images), 1)
image_sizes = [image.size for image in images]
modalities = ["image" for _ in images] # Repeat modality for every image in the batch
cont = model.generate(
input_ids_repeated,
images=image_tensor,
image_sizes=image_sizes,
modalities=modalities, # Added this line with the modalities
do_sample=False,
temperature=0,
max_new_tokens=4096,
)
text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)
for file, output in zip(image_files, text_outputs):
print(f"\n{file}: {output}")
@dshatwell23 Thanks for your template. Do we need to resize the images to be the same dimension inside each batch?
@HenryJunW When I tested it all my images had different sizes, but I had to pass a list of tuples with the individual dimensions (image_sizes) to the generate method and it handled it internally.
@HenryJunW When I tested it all my images had different sizes, but I had to pass a list of tuples with the individual dimensions (image_sizes) to the generate method and it handled it internally.
I an wondering about how do you perform the padding for the input tokens?
Is there a feasible way to conduct batch inference with LLaVA One Vision?