This is an Interesting work!
I'm new to this type of representation learning work, and I have a few inquiries regarding the downstream tasks. I apologize for any inconvenience my questions may cause.
How are the downstream tasks conducted? I comprehend that the 3D model is encoded into embeddings, which exist in the same representation space as text or images. Is this a presupposed prior knowledge that you expect all readers to possess?
Do you have any quantified baseline for the text-3D retrieval task?
We train Uni3D to align the 3D representation with the 2D/text representations of CLIP models for producing an aligned 3D-2D-text feature space. The 3D embeddings/representations of Uni3D aligned with 2D/text is the key for downstream tasks.
No. There is no commonly used benchmark for 3D-2D-text retrieval tasks.
This is an Interesting work! I'm new to this type of representation learning work, and I have a few inquiries regarding the downstream tasks. I apologize for any inconvenience my questions may cause.