baaivision / Uni3D

[ICLR'24 Spotlight] Uni3D: 3D Visual Representation from BAAI
MIT License
496 stars 29 forks source link

Question about the downstream tasks #2

Closed yuanze1024 closed 1 year ago

yuanze1024 commented 1 year ago

This is an Interesting work! I'm new to this type of representation learning work, and I have a few inquiries regarding the downstream tasks. I apologize for any inconvenience my questions may cause.

  1. How are the downstream tasks conducted? I comprehend that the 3D model is encoded into embeddings, which exist in the same representation space as text or images. Is this a presupposed prior knowledge that you expect all readers to possess?
  2. Do you have any quantified baseline for the text-3D retrieval task?
junshengzhou commented 1 year ago

Hi, Thanks for your interest on our work.

  1. We train Uni3D to align the 3D representation with the 2D/text representations of CLIP models for producing an aligned 3D-2D-text feature space. The 3D embeddings/representations of Uni3D aligned with 2D/text is the key for downstream tasks.
  2. No. There is no commonly used benchmark for 3D-2D-text retrieval tasks.
junshengzhou commented 1 year ago

I am closing this issue. If you have any more questions, please feel free to reopen it or create new issues. :)