Hi, thanks for sharing your amazing work. After going through your paper and some related work, I have some questions that I hope you could shed some light on. They are mainly about the downstream utilization of Point-Bind.
Question 1: This question is about the Any-to-3D Generation part. In my understanding, Point-Bind does not train the Image-Bind encoder, so does this mean that for text/image/audio-to-3D tasks, your approach is no different from directly applying Image-Bind features to CLIP-Forge's decoder?
Question 2: This question is about the Point-LLM part. It seems to me that during the training procedure, you finetuned LLaMA using the same strategy with ImageBind-LLM. Then in the inference part, you add in features extracted from the Point-Bind encoder. I am a little confused with the difference between Point-LLM and ImageBind-LLM. Also, it seems that in the ImageBind-LLM paper, they mention that for 3D domain instructions, they utilize Point-Bind to encode inputs.
Question 3: This question is also about the Point-LLM part. During inference, you feed the Point-Bind extracted features to the visual cache model to retrieve top-k similar ImageBind-encoded features. While this does address the semantic gap between 2D-3D encoders, doesn't this reduce the task of 3D question answering back to some sort of 2D scene question answering? Just like the example given in your paper, when the Point-LLM is given a point cloud of a plane and asked to describe the details of this object, it provides details about the color which seems unlikely to be learned from the point cloud itself (in my understanding, this might be because the top-k images features has encoded color information).
Thanks in advance and again for sharing your work.
@SARIHUST Thanks for your interest and in-depth comments! Hope our response can help.
Our final goal is to construct a general joint embedding space (ImageBind & Point-Bind) that incorporates 3D modality to existing any-to-any framework. The any-to-3D generation is just an initial attempt, and we also utilize Point-Bind's features for point-to-mesh generation. For example, we have achieved 3D-to-2D generation using Point-Bind's features with a 2D diffusion decoder. We will add further experiments of 'any-to-any with 3D' using Point-Bind's features in the following version.
We are from the same research group as ImageBind-LLM's authors. ImageBind-LLM can be viewed as a summary paper for many our multi-modality instruction tuning researches including Point-LLM.
Indeed, our advantage is free from any 3D instruction data, saving much data collection and tuning resources. The visual cache model can effectively reduce the 2D-3D gap of encoded features for the subsequent LLM, and the color information from 2D images is a trade-off for our data/tuning-free efficiency. We have also experimented with some 3D instruction tuning as follow-up works, which can effectively alleviate such situations.
Thanks!
Hi, thanks for sharing your amazing work. After going through your paper and some related work, I have some questions that I hope you could shed some light on. They are mainly about the downstream utilization of Point-Bind.
Question 1: This question is about the Any-to-3D Generation part. In my understanding, Point-Bind does not train the Image-Bind encoder, so does this mean that for text/image/audio-to-3D tasks, your approach is no different from directly applying Image-Bind features to CLIP-Forge's decoder?
Question 2: This question is about the Point-LLM part. It seems to me that during the training procedure, you finetuned LLaMA using the same strategy with ImageBind-LLM. Then in the inference part, you add in features extracted from the Point-Bind encoder. I am a little confused with the difference between Point-LLM and ImageBind-LLM. Also, it seems that in the ImageBind-LLM paper, they mention that for 3D domain instructions, they utilize Point-Bind to encode inputs.
Question 3: This question is also about the Point-LLM part. During inference, you feed the Point-Bind extracted features to the visual cache model to retrieve top-k similar ImageBind-encoded features. While this does address the semantic gap between 2D-3D encoders, doesn't this reduce the task of 3D question answering back to some sort of 2D scene question answering? Just like the example given in your paper, when the Point-LLM is given a point cloud of a plane and asked to describe the details of this object, it provides details about the color which seems unlikely to be learned from the point cloud itself (in my understanding, this might be because the top-k images features has encoded color information).
Thanks in advance and again for sharing your work.