-
Aiming to link natural language descriptions to specific regions in a 3D scene represented as 3D point clouds, 3D visual grounding is a very fundamental task for human-robot interaction. The recogniti…
-
Using crowdsourcing services, we collected 63,602 descriptions for approximately 249 unique objects across 1,380 scans as a RIORefer dataset.
[Paper](https://arxiv.org/pdf/2305.13876) [Code](https:…
-
Thanks for sharing the work. I notice that the model can output coordinates of the 3D bounding boxes throught numerical values. How to access this data related to 3D grounding tasks?
-
Multi-modal 3D dataset encompasses 1.4M meta-annotated captions on 109k objects and 7.7k regions as well as over 3.04M diverse samples for 3D visual grounding and question-answering benchmarks. The ov…
-
Dear authors,
I am wondering why the paper is said that Vote2Cap is tested on ScanRefer, not Scan2cap benchmark.
As long as I understand, ScanRefer takes pointclouds with a text query as inputs and …
-
# Interesting papers
## 카메라 포즈 찾기의 전쟁?
- [Pan 2024 - Global Structure-from-Motion Revisited](https://lpanaf.github.io/eccv24_glomap/)
- COLMAP의 저자 참여. COLMAP의 global mapping 파트 개선. 일주일 걸리는 …
-
[The format of the issue]
Paper name/title:
Project link:
Paper link:
Code link:
-
### Model description
Kosmos-2 is a grounded multimodal large language model, which integrates grounding and referring capabilities compared with Kosmos-1. The model can accept image regions select…
-
Congratulations to DeepSeek for the wonderful work. I wonder if there is a script for fine-tuning DeepSeek-VL? Thanks!
-
**Proceedings**
https://papers.nips.cc/book/advances-in-neural-information-processing-systems-30-2017
https://github.com/catpanda/NIPS_2017
**PaperLists (#Papers 679)**
https://www.dropbox.com/s…