Explanation for data format and issues about data generation

Germany321 commented 3 weeks ago

Thanks for your insteresting work. I visualize the grounded scene caption data and notice there is a key called 'all_phrases_positions'. What does it mean? I guess the numerical values represent the index after tokenizing the text prompt and you will replace the text embeddings in the corresponding indexes with the object tokens. Another question is how can you define the range of the place holder since there will be some adjective words such as ' table' or 'a chair <with four legs'?

ZzZZCHS commented 3 weeks ago

Hi, thank you for your interest!

all_phrases_positions contains a list of intervals that represent the start and end indices of each annotated phrase within the description string. object_ids records the corresponding object IDs associated with each phrase. Below is an example of how to use them:

description = "In the room, a dark-colored cabinet with glass doors stands elegantly with its curved top, near a smooth rectangular wooden table with four legs, and surrounded by five chairs of distinctive dark upholstery or sleek designs."
all_phrases_positions = [[13, 52], [96, 144], [164, 175]]
object_ids = [[2], [5], [8, 29, 30, 31, 32]]

for phrase_pos, obj_ids in zip(all_phrases_positions, object_ids):
    print(description[phrase_pos[0]:phrase_pos[1]], obj_ids)

"""
OUTPUT:
a dark-colored cabinet with glass doors [2]
a smooth rectangular wooden table with four legs [5]
five chairs [8, 29, 30, 31, 32]
"""

We prompted GPT-4 to make that each placeholder cover the whole phrase, including adjective words.

Germany321 commented 3 weeks ago

Thanks for your reply. Now I understand the format clearly.

OpenRobotLab / Grounded_3D-LLM

Explanation for data format and issues about data generation #5