Closed WeitaiKang closed 10 months ago
Hi, sorry for the late reply.
Yes, you can directly follow step 4 to fine-tune after_scene_align.pth on your data. I think simply extending the "caption" to "Obj00. Obj01. Obj02." would work. But a better way is to make the caption a natural sentence like "They are obj00, obj01, and obj02." (for multiple objects) or "It's obj00." (for a single object). Then you can post-process these raw outputs, extract the predicted object ids from the sentence. Also, remember to change the prompt to something like this: "Please find all possible objects that match this description. List their object IDs." A good prompt may significantly improve performance. You can extend related_ids
to something like [0, 1, 2]. Just make sure it is a list of valid ids. Actually It currently doesn't affect the model's forward process.
This is an interesting question. I don't think the language model can directly output the confidence score. Here is a way I think can bypass it. Suppose there are 50 objects in the scene. Enumerate from "It's obj00." to "It's obj50.", calculate the likelihood that the language model generates these sentences. Then you can get a ranking of all the objects and compute the AP score.
Thank you for your comprehensive response! It does help me a lot.
Hi, i have some followup questions about finetuning your model.
In 3.4 and 5.1, you state that you train projectors, the relation module, and the language model when finetuning on down stream task. However, i find that in here, you statically freeze the llama. So what do you mean by training the language model? Do i miss something here?
Also, I find that in Stage 4 of your GitHub's Training and Inference part, you have included scanrefer_pointgroup_train_stage2_grounding.json. Would you mind explain a little about this file? Since my private dataset doesn't have ground truth labels of instance segmentation, I wonder if i need to generate a pointgroup verison (scanrefer_pointgroup_train_stage2_grounding.json) of my own dataset.
Our original implement of finetuning language model was simply tuning the last several transformer layers (here). However, we found that the model can also achieve similar performance without finetuning the language model. So the current released code removes this part. We are working on using newer version of vicuna / llama and trying lora-based tuning method, which will be released in the future.
Models can not directly use grounding truth instance labels during inference. So you need to use a instance segmentor to extract objects from a scene. You can use PointGroup as we did, or use some stronger segmentors (SoftGroup, OneFormer3D) for better performance.
obj00
to obj n
). Then you need to extract each object's feature using pretrained encoder (Uni3D / ULIP-2 / ...), which is stored in file scannet_pointgroup_uni3d_feats.pt
.scanrefer_pointgroup_val_stage2_grounding.json
, scene_id
, obj_id
, prompt
are directly derived from ScanRefer val annotations (note that the obj_id
corresponds to ground truth instances). You may notice that ref_captions
are not consistant with obj_id
here. But you can just ignore them because they are not used in the calculation of final score.scanrefer_pointgroup_train_stage2_grounding.json
for fine-tuning. Different from the inference file, the obj_id
and ref_captions
here correspond to the extracted instances by pointgroup. This is decided by calculating IoUs between each extracted instance and the ground truth instance, then the one with max IoU is considered as the right answer (the instance matching to the caption). To enhance the quality of training, we filter out those annotatioins with max IoU less than 0.75 (or 0.5, I can't remember clearly).
But both these two train/val files need the original ground truth instance-caption pairs (like ScanRefer). You say you don't have ground truth labels of instance segmentation, that's a little confusing... What's the format of your original annotations for grounding?Thanks a lot! I think I get your point right now. Let me clarify a little bit.
My annotation is exactly akin to ScanRefer, only different in that i have multiple target ids. So I meant I cannot input the ground truth segmentation label (different from the Sr3D/Nr3D, but same as ScanRefer) to the model, when i say i don't have gt labels of instance segmentation. Sry for the confusion. My bad.
Anyway, I think i get your point for constructing the scanrefer_pointgroup_train_stage2_grounding.json. So what i need to do is to calculate the IOUs between the bbox of each instances (segmented by pointgroup) with the ground truth one in my annotation. Then re-assign the obj id in my annotation file (which is the id from ScanNet) as the id of instance (which is the label / order we assign to the pointgroup predicitions). Right?
Yeah, that's right~ And for the training, it's kind of neccesary to remove those annotations with low IoUs. Otherwise it may harm the performance.
Thank you for your patient reply! Look forward to your Lora-finetuning job!
Thanks for your great job in 3D-LLM field!
I am now trying to finetune your model on my own 3D visual grounding dataset. It's a private dataset and with more than one targets of each language. I have two questions on how to finetune your model on my dataset.
1) For finetuning, should I directly follow the "Step 4: Fine-tuning on Grounding Task", using after_scene_align.pth checkpoint? Since my dataset have more than one targets of each text, it's a right way to still following the format of "scanrefer_train_stage2_grounding.json" but simply extend the "caption" of it, like "Obj00. Obj01. Obj02" for the case of three targets? And also extend the "related_ids" of it, like [0, 1, 2].
2) Is it possible for Chat-3D v2 to output multiple objects' prediction with confidence score? I might need to calculate Average Precision on those results.