QwenLM / Qwen-VL

The official repo of Qwen-VL (通义千问-VL) chat & pretrained large vision language model proposed by Alibaba Cloud.
Other
4.49k stars 337 forks source link

what are the parameters that model accepts? #31

Open Ashwath-Shetty opened 11 months ago

Ashwath-Shetty commented 11 months ago

where can we find the parameters list? for e.x: max_length, min_length, temperature & any other parameters. it would be great if you can add the parameters list & descriptions on readme or somewhere.

ShuaiBai623 commented 11 months ago

In the path where you download the model, there is a file called "generation_config.json" where you can adjust the parameters for generation. Alternatively, you can directly modify the parameters in "model.generation_config". For a list of parameters, you can refer to the definition of GenerationConfig in "transformers.generation.configuration_utils.py".

Ashwath-Shetty commented 11 months ago

thanks for answering @ShuaiBai623

i have another question, i hope you don't mind.

why is Qwen-VL is not giving any useful output?

code:

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-VL", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL", device_map="cuda", trust_remote_code=True).eval()

# Specify hyperparameters for generation
model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-VL", trust_remote_code=True)

query = tokenizer.from_list_format([
    {'image': im2}, # Either a local path or an url
    {'text':'Compose a detailed account of the image, encompassing its visual characteristics, like colours, shapes, textures, objects, and the presence of any human subjects by paying careful attention to the specifics'},
])
inputs = tokenizer(query, return_tensors='pt')
inputs = inputs.to(model.device)
pred = model.generate(**inputs)
response = tokenizer.decode(pred.cpu()[0],skip_special_tokens=False)
print(response)

output:

Picture 1:<img>images/images_sample/f01-16-9780323479912.jpg</img>
Compose a detailed account of the image, encompassing its visual characteristics, like colours, shapes, textures, objects, and the presence of any human subjects by paying careful attention to the specifics of the image.<|endoftext|>

for the same prompt using Qwen-VL-Chat

Code:

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="cuda", trust_remote_code=True).eval()

# Specify hyperparameters for generation
model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True)

query = tokenizer.from_list_format([
    {'image': im2}, # Either a local path or an url
    {'text': 'Compose a detailed account of the image, encompassing its visual characteristics, like colours, shapes, textures, objects, and the presence of any human subjects by paying careful attention to the specifics'},
])
response, history = model.chat(tokenizer, query=query, history=None)
print(response)

output:

The image depicts a medical setting with two surgeons performing a procedure on a patient. Both surgeons are wearing blue gowns and are focused on the task at hand. One surgeon is standing to the left of the patient, while the other is positioned more centrally. 

The patient is lying on a bed, which occupies a significant portion of the image. A monitor is visible in the background, likely displaying important information related to the procedure. There are also a few other people in the room, likely observing the procedure or assisting in some capacity. 

Overall, the image conveys a sense of professionalism and focus as the surgeons work together to complete the medical procedure.

as you can see, Qwen-VL is not giving any useful prediction, even i have tried multiple prompts but the result is same. am i doing anything wrong here?

also, i'm consistently getting out of memory error for the longer prompt. my gpu mem=24gb.

ShuaiBai623 commented 11 months ago

"Qwen-VL-Chat, which is fine-tuned using Qwen-VL for instruction tuning, exhibits better compliance to instructions. On the other hand, Qwen-VL is primarily trained on tasks like image captioning and text generation, resulting in poorer performance in following instructions. However, you can still obtain desired outputs using approaches like few-shot learning. For example, you can format the input [{'image': img1}, {'text': example_answer}, {'image': img2}]

Ashwath-Shetty commented 11 months ago

thanks again for the reply. @ShuaiBai623

does 'Qwen-VL' supports few-short learning or is it only for 'Qwen-Chat'?

is there any example code or something to refer to?

i'm not sure if i understood your example correctly. is this the right way to do few-shot learning?

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL", device_map="cuda", trust_remote_code=True).eval()
model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-VL", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-VL", trust_remote_code=True)

query = tokenizer.from_list_format([
    {'image': im1}, 
    {'text':'in the image we can see a women petting the dog on a beach'}, # text which describes the image
    {'image': im2}, 
    {'text':'in the image we can see a man playing with a dog on the park'}
    {'image': im3}, 
    {'text':'explain the image in detail from what you have learned from the other images'}
])

inputs = tokenizer(query, return_tensors='pt')
inputs = inputs.to(model.device)
pred = model.generate(**inputs)
response = tokenizer.decode(pred.cpu()[0],skip_special_tokens=False)
print(response)

p.s: not really able to find anything in the internet about few-shot learning in LLVMs.