Closed jkstyle2 closed 6 months ago
I didn't meet this problem before. According to my experience, this problem cound be related to the LLM (vicuna) or the version of transformers package. Could you check the transformers version (we used transformers==4.28.1
)?
I see llama_model_path: model/vicuna-7b-delta-v0
in your config. Have you used apply_delta.py
to process the vicuna-7b-delta-v0
to get vicuna-7b-v0
?
oh I see, I misunderstood that the process of apply_delta.py is not necessary.
llama_model_path
would not work.Can you please share the weight of llama v1? It seems not available to get the v1 weight from the hugging face page you shared. What I currently could get was only v2. Also, the below is the result I got by executing run.sh. It seems something is working, but it's very slow and I don't know what's going on. Would you help me understand how evaluation script works ?
I don't have llama v1's weight right now because I deleted it after I transformed it into vicuna v0. Could you try this unofficial huggingface link? I find it in LLaMA-Adapter's repo. Hope it works. If it still doesn't work, maybe I need to find a way to directly share the vicuna-7b-v0 weight.
It seems that you directly load llama v2 weights for evaluation and it is running. But since the llama v2 weight is not consistent with our provided pretrained weights, the predicted results are random / meaningless words:
The scanrefer validate set contains 9508 samples. With batch size 1, you need 9508 iterations to evaluate all the samples.
It shows that it takes 26:48 to evaluate 1311 iterations, which means it will take over 3 hours to evaluate all the samples. That's really slow...
Here is what you can do to accelerate it:
max_txt_len
to a lower value here. For example, max_txt_len=16
or max_txt_len=8
.Anyway, you need to load a correct LLM's weight (vicuna-7b-v0) first. Otherwise, it's meaningless to run this code.
With the weights from the link you shared, I failed to convert it to HF format as below.
I've tried several other repos, but keep failing in the conversion. Would you suggest other methods?
Also, how do you assure if the current predicted results are meaningless? What's the expected results like? I'd like to know how we can check the results qualitatively using the output json file. Any visualization tools or can you guide us how to use the output json file?
Thanks for your considerate help!
I've uploaded the vicuna-7b-v0 to huggingface. You can download and directly load it (no need to use apply_delta.py
).
For the grounding task, the expected result is something like: "Obj17." (a natural sentence but only contains the object id) This output id refers to the instance id label of pointgroup segmented instances. Then we calculate the IoU between the predicted instance and the GT instance (in calc_scanrefer_grounding_acc.py).
oh, thanks for your support. I'll try it right now. I also found this repo and trying to adapt it into this project. I'll let you know once it done.
How can we recognize which object id is the "Obj17" in the 3D scene? Are they pre-defined in dataset?
Regarding accelerating, the current GPU mem size is 48GB, and there're 3 different batch size. To maximize inference speed, which batch size is dominant?
Thanks for your help!
The predicted object id corresponds to the object attributes saved in annotations/scannet_pointgroup_val_attributes.pt
. For example, if the predicted object id is 17 and the scene id is scene0011_00, then you can get the object's location and class label by:
import torch
attrs = torch.load('annotations/scannet_pointgroup_val_attributes.pt')
locations = attrs['scene0011_00']['locs'][17] # (center_x, center_y, center_z, size_x, size_y, size_z)
class_label = attrs['scene0011_00']['locs'][17]
These attributes annotations are extracted from pointgroup's predicted results (instance masks and labels). The pointgroup's results are quite large (over 30GB), so we didn't release it. You can follow pointgroup's repo to inference using their pretrained weights.
You can change s2_batch_size
.
With the weight you shared, I got following collision warning when cloning the git.
Not sure tho if it's corrupted or not. I'm checking it and there is a warning with red as below.
With the weight from this repo, I got the result below. Can you confirm if it's the expected result with your paper?
and one of the sample result is here. As far as I understand it, the test scene id is scene0011_00 and the prompt is the description from scanrefer. And 'pred' might be the predicted id from the model and 'ref_captions' might be the gt id. (Please correct me if i'm wrong) So if these ids are matched, then correct, otherwise wrong. Then what is 'obj_id' for?
Thanks for sharing the code snipet below for checking the ids.
import torch attrs = torch.load('annotations/scannet_pointgroup_val_attributes.pt') locations = attrs['scene0011_00']['locs'][17] # (center_x, center_y, center_z, size_x, size_y, size_z) class_label = attrs['scene0011_00']['locs'][17]
However, it would be much better to visualize the predicted/gt results with ids in 3D like your paper to analyze/debug the method in depth.
Is there any sample debugging code available ? or would you share the way how you analyze your method?
It seems that it is working properly now with your found repo.
I'm not sure why there is collision warning of my shared weights... Is this using my shared weights? If is, I think it is also working now. (just ignore the warning)
pred
is the generated/predicted results from the language model.
For clarity, we use segmented instances
to denote the predicted instances from pointgroup, and GT instances
for the GT instances from scannet annotations.
obj_id
is the id of the GT instance.ref_captions
is the id of a segmented instance which has the maximum IoU with the GT instance. (This is considered as the most similar one among all the segmented instances to the GT instance)So the Acc
metric here can roughly represent the grounding accuracy. For grounding task, we usually use Acc@m
to assess the model's performance. Acc@m
means if the predicted id/instance and the GT instance have the IoU >= m, they are considered matched. You can use calc_scanrefer_grounding_acc.py to calculate the Acc@0.25
and Acc@0.5
.
For visualization, I've uploaded an example code here. (You need to download scannet data following their repo to visualize the scene mesh.)
By running this code, you would get some ply files under vis/<scene_id>
folder. Use meshlab to visualize these ply files. You would get something like this:
It appears that the results from both weights are exactly same, although a warning sign exists.
I am still confused with IDs from obj_id
and ref_captions
. As you explained, obj_id
refers to the id of the GT instance, and ref_captions
refers to the id of the segmented instance with maximum IoU with GT. From this statement, it is considered that when LLM predicted the ID correctly and the segmented instance has maximum IoU with GT instance, the predicted ID should be same as GT ID.
As an example below, obj_id=1(GT ID) , pred=Obj19(predicted ID from LLM) , ref_captions=Obj19(segmented instance ID with the maximum IoU with GT). It seems IDs from predicted and gt are same to 19(so, correctly predicted?), but original GT ID is 1. Why these IDs are different?
I think that you are applying a segmentation algorithm to assign a unique ID for each segmented object. Then these IDs are processed with 3D geometric features and object attributes through 3D encoder, 3D-Language projection, Relation module and Language model. In all the process, the initially assigned IDs are all unique? also, are they all uniquely assigned in other downstream tasks?
GT instances and segmented instances are assigned under two seperate group of IDs. For example, in scene0011_00, there are 33 GT instances assigned ID from 0 to 32:
While there are 27 segmented instances assigned ID from 0 to 26:
We cannot directly compare IDs between these two groups. To compare pred
(a segmented ID) with obj_id
(a GT ID), you need to calculate their IoU like this.
oh I see, now I got it. So, once obj_id and pred are same, then it would mean chat-3d-v2 predicts correctly.
I'd like to try the whole pipeline from initial processing 3D scans in PointGroup to final estimation from LLM. Are you planning to share the TODO preparation part for extracting instances by PointGroup within a few weeks? As it is a two-stage grounder, it is considered that the overall performance would rely on the initial feature extractor. Would you think it helpful when substituting PointGroup to other SOTA 3D feature extractor?
So, once obj_id and pred are same, then it would mean chat-3d-v2 predicts correctly.
obj_id
is for GT instances, while pred
is for segmented instances. They are not comparable. You can say it predicts correctly when ref_captions
and pred
are the same. But the exact accuracy of the predicted instance depends on the quality of the pretrained segmentor, so it's better to calculate the IoU and use metrics like Acc@0.5
to evaluate the accuracy.
Actually we have recently replaced PointGroup with Mask3D (a stronger instance segmentor). We will update a refined version in this repo soon, as well as the preparation part.
You can say it predicts correctly when
ref_captions
andpred
are the same.
I made a mistake, this is exactly what I meant. Thanks a lot for correcting me!
Actually we have recently replaced PointGroup with Mask3D (a stronger instance segmentor). We will update a refined version in this repo soon, as well as the preparation part.
Can't wait to see the new result! Very Look forward to it :) I've been doing a research in robotics navigation, and I'd like to refer to your method.
Thanks for sharing your great work!
Thank you for your interest in our work~
Hello,
I followed step-by-step your guidance, modifying config.py and run.sh. When I run
./scripts/run.sh
, I got following multiprocessing error on llama_tokenizer_decode(). Could you help me handle this issue?