InternLM / InternLM-XComposer

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
2.06k stars 127 forks source link

Regarding "<ImageHere>" #225

Closed thiner closed 2 months ago

thiner commented 3 months ago
  1. Is <ImageHere> a fixed placeholder in text prompt?
  2. What kind of value does the VL model expect? A path, URL or base64 encoded image?
yuhangzang commented 3 months ago

Hi thiner, you may refer to this line and this line for your questions.

thiner commented 3 months ago

@yhcao6 Thanks for your answer. I'd like to summarize my study from the code, please correct me if misunderstood the logic.

  1. <ImageHere> is a fixed placeholder which separate image and text prompt.
  2. XComposer-VL expects the image input be a path which is recognizable by PIL.Image.open method or a torch.Tensor instance.

Based on above summaries, I have a further question, does XComposer-VL supports multiple images as input? I think it's not supported currently, is it?

yuhangzang commented 3 months ago

XComposer-VL supports multiple images as input, e.g., query = '<ImageHere> <ImageHere> balabala', img_path = ['a.jpg', 'b.jpg']

yuhangzang commented 2 months ago

Kindly reopen this issue if you have any further questions.