csuhan / OneLLM

[CVPR 2024] OneLLM: One Framework to Align All Modalities with Language
Other
552 stars 27 forks source link

Some confusion about the modalities of depth/normal maps. #9

Closed laion101 closed 7 months ago

laion101 commented 8 months ago

Thank you for your outstanding work.

I noticed that when running the demo you provided, for QA inference in the modalities of depth/normal maps, it seems essential to provide both the RGB image and the depth/normal maps together to obtain accurate answers. If only the depth/normal information is provided, the system appears unable to respond to questions.

Could you clarify whether the intended functionality of this system in the depth/normal mode aligns with the paper, which suggests that QA inference can be accomplished solely based on depth/normal information?

截屏2024-01-05 21 16 08

![Uploading WechatIMG5543.jpg…]()

laion101 commented 8 months ago
截屏2024-01-05 22 01 59
csuhan commented 8 months ago

Hi @laion101 , currently we need paired depth map and image as input. In general, depth/normal maps serve as auxiliary information for image perception, such as RGBD classification/detection. Therefore we use both RGB images and depth maps as input where depth maps help the model to understand RGB images.

laion101 commented 7 months ago

Got it, thanks for your reply!