csuhan / OneLLM

[CVPR 2024] OneLLM: One Framework to Align All Modalities with Language
Other
596 stars 32 forks source link

Some confusion about the modalities of depth/normal maps. #9

Closed laion101 closed 10 months ago

laion101 commented 10 months ago

Thank you for your outstanding work.

I noticed that when running the demo you provided, for QA inference in the modalities of depth/normal maps, it seems essential to provide both the RGB image and the depth/normal maps together to obtain accurate answers. If only the depth/normal information is provided, the system appears unable to respond to questions.

Could you clarify whether the intended functionality of this system in the depth/normal mode aligns with the paper, which suggests that QA inference can be accomplished solely based on depth/normal information?

截屏2024-01-05 21 16 08

![Uploading WechatIMG5543.jpg…]()

laion101 commented 10 months ago
截屏2024-01-05 22 01 59
csuhan commented 10 months ago

Hi @laion101 , currently we need paired depth map and image as input. In general, depth/normal maps serve as auxiliary information for image perception, such as RGBD classification/detection. Therefore we use both RGB images and depth maps as input where depth maps help the model to understand RGB images.

laion101 commented 10 months ago

Got it, thanks for your reply!