Closed 2a3b4c closed 7 months ago
Hi, projection_matrix (768*768) is the CLIP projection matrix, which should be weight.data of Linear layer defined in CLIP (out_dim, in_dim). We actually didn't use the image embedding, therefore, you can comment out this line. Let me know if you have further questions.
update: I have pushed the codes with this line cemented out, plz let me know if you meet other issues.
I review the decode processed data, and find the corresponding caption of every segmantion is empty, is it normal?
I don't think so, except in cases where all instances are quite small (smaller than 32x32). You have the option to modify this line to if area >= 0*0:
to include instance captions for all instances. However, be aware that captions for very small instances might be less accurate. Alternatively, using the category name as the instance caption is a simpler option that might also work.
thanks for your answer, i used the coco dataset to process, and the key "is_det" in data is 0, so caption information in the decoded data can not be obtained. I just wander waht the key "is_det" meaning? and the meaning of o365 in the corresponding comment "# if it is from detection (such as o365), then we will make a pseudo caption"
If is_det
is 0, the model will use the generated instance captions. Otherwise, it will use the category name (from ground-truth or an object detection model) as the pseudo instance caption.
how to get the "projection_matrix" file?