Question about the downstream tasks

yuanze1024 commented 1 year ago

This is an Interesting work! I'm new to this type of representation learning work, and I have a few inquiries regarding the downstream tasks. I apologize for any inconvenience my questions may cause.

How are the downstream tasks conducted? I comprehend that the 3D model is encoded into embeddings, which exist in the same representation space as text or images. Is this a presupposed prior knowledge that you expect all readers to possess?
Do you have any quantified baseline for the text-3D retrieval task?

junshengzhou commented 1 year ago

Hi, Thanks for your interest on our work.

We train Uni3D to align the 3D representation with the 2D/text representations of CLIP models for producing an aligned 3D-2D-text feature space. The 3D embeddings/representations of Uni3D aligned with 2D/text is the key for downstream tasks.
No. There is no commonly used benchmark for 3D-2D-text retrieval tasks.

junshengzhou commented 1 year ago

I am closing this issue. If you have any more questions, please feel free to reopen it or create new issues. :)

baaivision / Uni3D

Question about the downstream tasks #2