Yangyi-Chen / SOLO

Public code repo for paper "A Single Transformer for Scalable Vision-Language Modeling"
Apache License 2.0
111 stars 3 forks source link

[question] any plan to add multi images reasoning? or even multi frame reasoning for video understanding #3

Closed eisneim closed 3 months ago

eisneim commented 3 months ago

Hi authors of this great project!

Fuyu-8 is great for it's flexibility to accept any aspect ratio or resolution great for UI understanding, but we don't know how it's trained. but now we got SOLO!

just a quick question, the paper didn't mention anything about multi image reasoning, is there a plan to do instruction tuning on multi image dataset like: Mantis

one image with ≈1024 context length might be expensive but we can scale the image down to 224x224 for high level semantics, so one video frame would be just ≈49 context length, in combination with recent 1m context length tricks we can do long video understanding using SOLO; adding something like tokenpacker in between projection layer and embedding layer we might use 32 tokens to represent a frame of a video

Yangyi-Chen commented 3 months ago

Hi Eisneim, Many thanks for your encouragement and for sharing these great papers! Yes doing multi-image understanding and even performing video understanding is one of our future plans. But now we are focusing on further improving the fundamental capability of SOLO. We will list multi-image and video understanding as our next step. Thanks!