QwenLM / Qwen2-VL

Qwen2-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
Apache License 2.0
2.99k stars 175 forks source link

Some SFT Questions #129

Open wubangcai opened 2 months ago

wubangcai commented 2 months ago

In the README file, I only found instructions on how to set the image size during inference, but how do I set the image resolution during SFT with LLamA-Factory?

huynhbaobk commented 2 months ago

I found that in the file LLaMA-Factory/src/llamafactory/data/mm_plugin.py, there is a function _regularize_images which loads image_resolution: int = getattr(processor, "image_resolution", 512). However, in the processor for Qwen2-VL, there is no image_resolution attribute, so it always defaults to 512. I'm not sure if the author intended to use this setting for training or not. We need confirmation on this.

HenryHe0123 commented 2 months ago

Same question, I tried to SFT with LLamA-Factory's example settings on 1080p images directly, but encountered RuntimeError: shape mismatch. I suspect it's because I didn't set the image resolution.

hiyouga commented 2 months ago

We have image_resolution argument to control the maximum width or height of input images. Use --image_resolution 1024 to specify it.

https://github.com/hiyouga/LLaMA-Factory/blob/90d6df622252c6fad985f68b97771c979357e2fc/src/llamafactory/hparams/model_args.py#L59-L62

wubangcai commented 2 months ago

@hiyouga Thank you for your reply. I have checked the relevant resolution settings. However, a more confusing issue is that Qwen2-VL is supposed to support dynamic resolution, I think the training is to scale the original image to less than the maximum support resolution, rather than a definite resolution. But at present I do not see the relevant operation in the code, of course, it is possible that I did not find, hope the author can clarify.

hiyouga commented 2 months ago

@wubangcai We also support dynamic resolution. We only resize the image if its width or height exceeds the image_resolution parameter.

https://github.com/hiyouga/LLaMA-Factory/blob/bdde35fd2e4a919c1d63ebfc9a0ea8ba0c97e14c/src/llamafactory/data/mm_plugin.py#L77-L80

wubangcai commented 2 months ago

@hiyouga Thank you, I get it. @huynhbaobk Thanks again for your reply. We can set a larger pixel, to avoid the resize, and then the function: smart_resiz intransformers/models/qwen2_vl/image_processing_qwen2_vl.py will do dynamic scaling.