[question] any plan to add multi images reasoning? or even multi frame reasoning for video understanding

Yangyi-Chen / SOLO

Public code repo for paper "A Single Transformer for Scalable Vision-Language Modeling"

Apache License 2.0

111 stars 3 forks source link

Hi authors of this great project!

Fuyu-8 is great for it's flexibility to accept any aspect ratio or resolution great for UI understanding, but we don't know how it's trained. but now we got SOLO!

just a quick question, the paper didn't mention anything about multi image reasoning, is there a plan to do instruction tuning on multi image dataset like: Mantis

one image with ≈1024 context length might be expensive but we can scale the image down to 224x224 for high level semantics, so one video frame would be just ≈49 context length, in combination with recent 1m context length tricks we can do long video understanding using SOLO; adding something like tokenpacker in between projection layer and embedding layer we might use 32 tokens to represent a frame of a video

Yangyi-Chen / SOLO

[question] any plan to add multi images reasoning? or even multi frame reasoning for video understanding #3