BAAI-DCAI / SpatialBot

The official repo for "SpatialBot: Precise Spatial Understanding with Vision Language Models.
MIT License
157 stars 10 forks source link

hi~how do you achieve DepthAPI? #10

Closed yun189 closed 2 months ago

yun189 commented 2 months ago

I cannot find code about DepthAPI in project.

RussRobin commented 2 months ago

Hi @yun189 , thank you for your interest in SpatialBot.

  1. In training, if the model is asked about depth value, we design two QA types: (a) Directly answer. So the model need to get depth information from depth map input directly (b) Call Depth API. e.g. The depth value of point(0.51,0.40) is Depth(0.51,0.40). In the following conversation: API: Depth(0.51,0.40)=...; Model: Use the depth value to do following tasks

  2. In evaluation, if you want the model to directly use depth map, you can prompt with 'Answer directly with depth map'. If not, the model will choose by itself whether to use API or not. If it uses DepthAPI, you only need to tell it the corresponding depth value. Sample codes:

    
    def extract_depth_api(response):
    pattern = r"Depth\(([\d.]+),\s*([\d.]+)\)"
    matches = re.findall(pattern, response)
    
    if matches:
        results = [(float(match[0]), float(match[1])) for match in matches]
        return results
    else:
        return None

response = call_model_engine(args, sample, model, tokenizer, processor) response = str(response) depth_api_values = extract_depth_api(response)

if depth_api_values is not None: sample['question_2'] = '' sample['response_1'] = response for depth_api_value in depth_api_values: x, y = depth_api_value[0],depth_api_value[1] depth_map = Image.open(...) width, height = depth_map.size depth_map = np.array(depth_map) pixel_x = max(int(x width) - 1, 0)
pixel_y = max(int(y
height) - 1, 0) depth_value = depth_map[pixel_y, pixel_x] sample['question_2'] = sample['question_2']+'Depth('+str(x)+','+str(y)+')='+str(depth_value)+', ' sample['question_2'] = sample['question_2'].strip(', ') response_after_api = call_model_engine(args, sample, model, tokenizer, processor)


So the conversation may look like:

User: What is the depth value of object: elephant? SpatialBot: The elephant corresponds to a bounding box of [0.15,0.70,0.25,0.0.80], so it corresponds to Depth(0.2,0.75). User(API): Depth(0.2,0.75) = 100 SpatialBot: The depth value of elephant is 100.

RussRobin commented 2 months ago

I'll close this issue since no further questions are raised for a week. Feel free to reopen it if you still have concerns about our work.