InternLM / InternLM-XComposer

InternLM-XComposer2 is a groundbreaking vision-language large model (VLLM) excelling in free-form text-image composition and comprehension.
1.92k stars 121 forks source link

Grounding ability of xcomposer2-4khd #261

Closed laserwave closed 2 months ago

laserwave commented 2 months ago

xcomposer2-4khd是否支持REC(reference expression comprehension)和REG(reference expression generation)任务呢,动态分辨率是否难以学习这两个任务

LightDXY commented 2 months ago

hi, this is an interesting question, our 4khd model indeed supports grounding (btw, the xcompoer2 also has strong grounding capability), we provide the image width and height to the model, and predict the pixel directly, here are a few examples. 4khd works well on Images with different sizes.

  1. image
  2. image
  3. image
  4. image
  5. image
laserwave commented 2 months ago

thank you

laserwave commented 2 months ago

@LightDXY hi,does it support REG,i.e. image caption or vqa task corresponding to a given image region/bounding box

LightDXY commented 2 months ago

hi , REG is also supported, as it is a symmetrical task of grounding. For the vqa task corresponding to a given image region/bounding box, we do not use such data in the training.