Missing segmentation mask when tracking multiple objects

facebookresearch / segment-anything-2

The repository provides code for running inference with the Meta Segment Anything Model 2 (SAM 2), links for downloading the trained model checkpoints, and example notebooks that show how to use the model.

Apache License 2.0

10.71k stars 863 forks source link

Missing segmentation mask when tracking multiple objects #249

Open ABHISHEKVALSAN opened 3 weeks ago

ABHISHEKVALSAN commented 3 weeks ago

I'm trying to track multiple objects in the video sequence, (i'm using the magician video in sam metademolab site for representation)

My first click was on 1st frame on the cloth for _objid=1, (_add_new_points_or_box_) My second click was on 30th frame on the hat for _objid=2

After this, I clicked on the track objects (_propagate_in_video_)

Once I run the track objects i observed that at the 30th frame of the video, the segmentation mask for _objid=1 (cloth) is missing. The same segmentation mask reappear in 31st frame.

Frame 29 Frame 30 Frame 31

Is this a bug or am i missing any configuration setting in my code?

I have a similar code that mimics this whole setup in my local server.

I'm using function _build_sam2_videopredictor to load the model, _add_new_points_orbbox to add points, and _propagate_invideo to track the objects.

The model i'm using is _sam2_hieralarge.pt, with cfg file _sam2_hieral.yaml

My desired functionality is to have segmentation mask throughout the video, as if the object is tracked individually.

PS: When there is only one object to be tracked, it is getting tracked without any issue. Or even when the track points for both objects are provided in the same frame then also it is working.

PPS: I looked at the video_predictor_example.ipynb, and build_sam.py (and prolly still looking) for answer to this. Any help is much appreciated.

PierreId commented 3 weeks ago

I have the exact same problem (on multiple videos) with the online demo and when I test in local : works very well with one object, but it has problems when multiple objects are added at different frames.

catalys1 commented 3 weeks ago

I don't know for sure, but my guess is that it has to do with the functionality for adding new interactions. Since you don't also add an annotation for the cloth on that frame, it get doesn't get tracked. I think the code handles things differently for frames where you are adding annotations vs those where you aren't.

daohu527 commented 3 weeks ago

Below is a description of the dataset. It seems that doing this will cause some problems? Although the object appears in the first frame, it is not selected

I mean is an undefined behavior of the model

Note: a limitation of the vos_inference.py script above is that currently it only supports VOS datasets where all objects to track already appear on frame 0 in each video (and therefore it doesn't apply to some datasets such as LVOS that have objects only appearing in the middle of a video).

PierreId commented 2 weeks ago

From what I have read in the code, the implementation expects that each frame with a prompt must contain the info of every object. Indeed, the memory bank stores data by frame id. And, for a frame with partial prompts, the existing data (computed when the prompt were added) will be reused without running any inference.

The only workaround that I found is to create one "inference_state" per object. To do that, I :

cloned the initial inference_state (deep copy of the object, except for the "images" that are reused)
stored the "Image Encoder" outputs while running propagate_in_video() on the first object, and reused it for next object

This way, I can track many objects (70 objects on 200 frames in my tests) without OOM (15GB of VRAM with the Large model) and reasonable inference time (Image Encoder only run once).

ayushjain1144 commented 2 weeks ago

Hi @PierreId, I am facing similar issues. Could you provide more details of your approach?

Specifically, for creating new inference_state per object, you do deepcopy(inference_state) -- but what do you mean by "except the images that are reused"?

I also didn't understand this part: "stored the "Image Encoder" outputs while running propagate_in_video() on the first object, and reused it for next object". Isn't the image encoder generate features when we run "init_state"

Thank you!

PierreId commented 2 weeks ago

Basically, I did something like that for creating one inference state per object :

# Create reference state
ref_inference_state = predictor.init_state(...)
ref_images = ref_inference_state['images']
ref_inference_state['images'] = None
# Create new state for object N
new_object_state = copy.deepcopy(ref_inference_state)
new_object_state['images'] = ref_images

And for the "Image Encoder", since it only depends on the input image, and not on the tracker (memory bank or object prompts), you can reuse it :

class SAM2VideoPredictor(SAM2Base):
    def __init__(...):
        ...
        self.stored_image_encoder_data = {}

...

    def _run_single_frame_inference(...):
        """Run tracking on a single frame based on current inputs and previous memory."""
        # Retrieve correct image features
        if frame_idx in self.last_image_encoder_data:
            current_vision_feats, current_vision_pos_embeds, feat_sizes = self.stored_image_encoder_data[frame_idx]
        else:
            (
                _,
                _,
                current_vision_feats,
                current_vision_pos_embeds,
                feat_sizes,
            ) = self._get_image_feature(inference_state, frame_idx, batch_size)

            # Save info
            self.stored_image_encoder_data[frame_idx] = current_vision_feats, current_vision_pos_embeds, feat_sizes

        ...

jeezrick commented 1 week ago

regarding the original problem of this issue, it's because frame 1 & frame 30 is labeled as cond_frame, and in code the track result gets directly returned when cond_frame. So the frame 30 and frame 1 segmentation result will be exactly the same as the one when you first add_point on it, there is no extra inference on memory in them. here is the code:

        for frame_idx in tqdm(processing_order, desc="propagate in video"):
            # We skip those frames already in consolidated outputs (these are frames
            # that received input clicks or mask). Note that we cannot directly run
            # batched forward on them via `_run_single_frame_inference` because the
            # number of clicks on each object might be different.
            if frame_idx in consolidated_frame_inds["cond_frame_outputs"]:
                storage_key = "cond_frame_outputs"
                current_out = output_dict[storage_key][frame_idx]
                pred_masks = current_out["pred_masks"]
                if clear_non_cond_mem:
                    # clear non-conditioning memory of the surrounding frames
                    self._clear_non_cond_mem_around_input(inference_state, frame_idx)
            elif frame_idx in consolidated_frame_inds["non_cond_frame_outputs"]:
                storage_key = "non_cond_frame_outputs"
                current_out = output_dict[storage_key][frame_idx]
                pred_masks = current_out["pred_masks"]
            else:
                storage_key = "non_cond_frame_outputs"
                current_out, pred_masks = self._run_single_frame_inference(
                    inference_state=inference_state,
                    output_dict=output_dict,
                    frame_idx=frame_idx,
                    batch_size=batch_size,
                    is_init_cond_frame=False,
                    point_inputs=None,
                    mask_inputs=None,
                    reverse=reverse,
                    run_mem_encoder=True,
                )
                output_dict[storage_key][frame_idx] = current_out

ayushjain1144 commented 1 week ago

Hi @PierreId , Thanks for your response. Caching image encoder features helped quite a bit.

About your first suggestion of creating a new inference state per object via deepcopy --- wouldn't just doing reset_state on the inference state work just as well? Concretely, I am thinking that we track an object with an inference state, then reset the inference state and track the next object and so on. I am thinking your approach is sequential too i.e. would track only one object at a time and then move to next. Am I missing something?

jeezrick commented 1 week ago

Hi @PierreId , Thanks for your response. Caching image encoder features helped quite a bit.

About your first suggestion of creating a new inference state per object via deepcopy --- wouldn't just doing reset_state on the inference state work just as well? Concretely, I am thinking that we track an object with an inference state, then reset the inference state and track the next object and so on. I am thinking your approach is sequential too i.e. would track only one object at a time and then move to next. Am I missing something?

I think what he trying to do is track multiple thing simultaneously.

ayushjain1144 commented 1 week ago

I see, and that would be amazing, but do you know how he is doing that? I am thinking that propagate_in_video needs to be called with one single inference state, so unless we do some multithreading, how can we run multiple propagate_in_video with different inference states?