Thanks for sharing the paper and code. It's a great work!
After reading the paper, I find CG3D and ULIP are very similar in terms of methodology. Both of them pre-trained on (pointcloud, image, text) triplets using contrastive loss. The conducted tasks are also similar, zero-shot pointcloud recognition, retrieval and finetuning the pre-trained 3d backbones on the downstream tasks. Even the used hardware is same, e.g., 8 A100 GPUs.
A little difference may be CG3D exploited learnable prompts in the visual encoder of CLIP while ULIP didn't use prompt tuning. In addition, CG3D conducted the scene querying task while ULIP didn't. Finally, ULIP released their curated triplet datasets but CG3D didn't.
Since ULIP (CVPR'23) was published before CG3D (ICCV'23), I think the authors should discuss the difference between current work and ULIP in the paper.
Thanks for sharing the paper and code. It's a great work!
After reading the paper, I find CG3D and ULIP are very similar in terms of methodology. Both of them pre-trained on (pointcloud, image, text) triplets using contrastive loss. The conducted tasks are also similar, zero-shot pointcloud recognition, retrieval and finetuning the pre-trained 3d backbones on the downstream tasks. Even the used hardware is same, e.g., 8 A100 GPUs.
A little difference may be CG3D exploited learnable prompts in the visual encoder of CLIP while ULIP didn't use prompt tuning. In addition, CG3D conducted the scene querying task while ULIP didn't. Finally, ULIP released their curated triplet datasets but CG3D didn't.
Since ULIP (CVPR'23) was published before CG3D (ICCV'23), I think the authors should discuss the difference between current work and ULIP in the paper.