Open Germany321 opened 3 weeks ago
Hi, thank you for your interest!
all_phrases_positions
contains a list of intervals that represent the start and end indices of each annotated phrase within the description
string. object_ids
records the corresponding object IDs associated with each phrase. Below is an example of how to use them:
description = "In the room, a dark-colored cabinet with glass doors stands elegantly with its curved top, near a smooth rectangular wooden table with four legs, and surrounded by five chairs of distinctive dark upholstery or sleek designs."
all_phrases_positions = [[13, 52], [96, 144], [164, 175]]
object_ids = [[2], [5], [8, 29, 30, 31, 32]]
for phrase_pos, obj_ids in zip(all_phrases_positions, object_ids):
print(description[phrase_pos[0]:phrase_pos[1]], obj_ids)
"""
OUTPUT:
a dark-colored cabinet with glass doors [2]
a smooth rectangular wooden table with four legs [5]
five chairs [8, 29, 30, 31, 32]
"""
We prompted GPT-4 to make that each placeholder cover the whole phrase, including adjective words.
Thanks for your reply. Now I understand the format clearly.
Thanks for your insteresting work. I visualize the grounded scene caption data and notice there is a key called 'all_phrases_positions'. What does it mean? I guess the numerical values represent the index after tokenizing the text prompt and you will replace the text embeddings in the corresponding indexes with the object tokens. Another question is how can you define the range of the place holder since there will be some adjective words such as ' table' or 'a chair <with four legs'?