Open gaohan-cmd opened 2 months ago
checkpoint_best.pth
.
- We always evaluate our method with the
checkpoint_best.pth
.- The evaluations are similar to 1. However there is a slight difference between our released codebase and the main paper. The reported results are trained with all the 3D-LLM data, regardless of duplications. Meanwhile, we drop the duplicates in our released codebase.
- Table 8 shows the effectiveness of "test-time" visual prompts, while other tables evaluates the model with text-only interactions.
Thank you very much for your response! But I still have some questions:
For answer two, does "The reported results are trained with all the 3D-LLM data" mean that when I run bash scripts/opt-1.3b/train.generalist.sh, I only need to use the datasets unified_3dllm_scene_description, unified_3dllm_embodied_dialogue, unified_3dllm_embodied_planning, and the rest of the datasets are only used during fine-tuning? Regarding answer three, how are "test-time" visual prompts specifically implemented in the code? Are visual prompts operations like clik and _encode_box_coords in the unified_scanqa.py file? How can I easily control whether to use visual prompts or text prompts during testing?
More comment on:
Q2 - No, the all the 3D-LLM data
refers to using all the ScanNet part of 3D-LLM before data cleansing, which might contain duplicated training samples. We have not released this copy of data.
Q3 - For quantitative results for row-2 in Table 8, we naively use all the object-id annotations for both training and evaluation, since the original annotations selects more objects than what's related to the question. We have not released that code either. Indeed, the text instructions are required while the visual prompts are optional, and only adopted in tasks like ScanQA, 3D dense captioning, and 3D open-vocabulary detection.
More comment on:
Q2 - No, the
all the 3D-LLM data
refers to using all the ScanNet part of 3D-LLM before data cleansing, which might contain duplicated training samples. We have not released this copy of data.Q3 - For quantitative results for row-2 in Table 8, we naively use all the object-id annotations for both training and evaluation, since the original annotations selects more objects than what's related to the question. We have not released that code either. Indeed, the text instructions are required while the visual prompts are optional, and only adopted in tasks like ScanQA, 3D dense captioning, and 3D open-vocabulary detection.
OK, thank you for your answer 😊
Hello there! I'm interested in your work, but I'm having some differences when reproducing the results of the paper. So, I'd like to consult with you.