MaverickRen / PixelLM

PixelLM is an effective and efficient LMM for pixel-level reasoning and understanding. PixelLM is accepted by CVPR 2024.
Apache License 2.0
147 stars 4 forks source link

'answers' is not complete in MUSE_train.json #14

Open wusize opened 3 months ago

wusize commented 3 months ago

Hi! Thanks for making the muse dataset public.

However, I find the annotation in MUSE_train.json is incomplete, with 'answers' being only a list of segmentation masks. To my understanding, there should be an answer sentence with object nouns grounded to the masks. Screenshot from 2024-03-29 14-21-31

MaverickRen commented 3 months ago

Thank you for your question. As mentioned in the paper's appendix, some of the answers in the MUSE data only include detailed descriptions of the target objects, which are generated by GPT4V. When using this part of the data, the text of the answer will be directly composed of these descriptions. The rest of the data is composed in the way you mentioned.

wusize commented 3 months ago

Thanks for the quick reply! I found such answers in the json. The grounded phrases are followed by a '{seg}' token or a bounding box list. However, is there any way to associate a whole phrase with its corresponding mask?

For example, given the answer text The individual can settle into the cozy armchair [314.06, 252.31, 130.87, 81.69] in the room. For attire, there is a choice between the polo shirt with broader dimensions [259.36, 165.33, 59.45, 106.72] and the smaller polo shirt [129.59, 99.22, 147.5, 149.47]., how can we automatically extract the cozy armchair, the polo shirt with broader dimensions, the smaller polo shirt from the text while knowing which mask annotations they are associated with?

Divyanshsingh1910 commented 1 month ago

@wusize Could you please share which indexes of the train.json have these complete answers, I am unable to find any of them. Also, did you find a way out to this text-segmentation_mask pair from the above shown answer types

Thanks!