Thanks for your great work! But I have the following questions:
I see you have aggregated multi-view Dinov2 features to one object. So for each object, how many tokens do you finally use to represent 3D proposal and multi-view 2D features, respectively? I think then after 3D tokens and 2D tokens are concatenated, they are sent as inputs to llm. Is that right? Can the number of both 3D and 2D used tokens be changed in the current code?
Could you provide the inference time of Chat-Scene model or do you know how to compare the inference time of Chat-Scene and other 3D-LLMs?
Looking forward to your reply~ Thank you very much.
For each object, we use one token for 3D feature and one token for 2D feature, respectively. The prompt for LLM is something like this: < OBJ000 > <3d_feat_0> <2d_feat_0> < OBJ001 > <3d_feat_1> <2d_feat_1> .... < OBJ099 > <3d_feat_99> <2d_feat_99>. Currently we fix the detected object number to 100 for each scene, so the token usage is 3 * 100 = 300 for representing the whole scene.
I didn't do the test of inference time before. How do you define the inference time? the model forward time for one batch? Does it include the preprocess time (instance segmentation and feature extraction)? I can run a test for you if you could give me a clear definition.
Thanks for your great work! But I have the following questions:
Looking forward to your reply~ Thank you very much.