facebookresearch / sam2

The repository provides code for running inference with the Meta Segment Anything Model 2 (SAM 2), links for downloading the trained model checkpoints, and example notebooks that show how to use the model.
Apache License 2.0
11.52k stars 999 forks source link

Missing segmentation mask when tracking multiple objects #249

Open ABHISHEKVALSAN opened 1 month ago

ABHISHEKVALSAN commented 1 month ago

I'm trying to track multiple objects in the video sequence, (i'm using the magician video in sam metademolab site for representation)

My first click was on 1st frame on the cloth for _objid=1, (_add_new_points_or_box_) My second click was on 30th frame on the hat for _objid=2

After this, I clicked on the track objects (_propagate_in_video_)

Once I run the track objects i observed that at the 30th frame of the video, the segmentation mask for _objid=1 (cloth) is missing. The same segmentation mask reappear in 31st frame.

Frame 29 Frame 29 Frame 30 Frame 30 Frame 31 Frame 31

Is this a bug or am i missing any configuration setting in my code?

I have a similar code that mimics this whole setup in my local server.

I'm using function _build_sam2_videopredictor to load the model, _add_new_points_orbbox to add points, and _propagate_invideo to track the objects.

The model i'm using is _sam2_hieralarge.pt, with cfg file _sam2_hieral.yaml

My desired functionality is to have segmentation mask throughout the video, as if the object is tracked individually.

PS: When there is only one object to be tracked, it is getting tracked without any issue. Or even when the track points for both objects are provided in the same frame then also it is working.

PPS: I looked at the video_predictor_example.ipynb, and build_sam.py (and prolly still looking) for answer to this. Any help is much appreciated.

PierreId commented 1 month ago

I have the exact same problem (on multiple videos) with the online demo and when I test in local : works very well with one object, but it has problems when multiple objects are added at different frames.

catalys1 commented 1 month ago

I don't know for sure, but my guess is that it has to do with the functionality for adding new interactions. Since you don't also add an annotation for the cloth on that frame, it get doesn't get tracked. I think the code handles things differently for frames where you are adding annotations vs those where you aren't.

daohu527 commented 1 month ago

Below is a description of the dataset. It seems that doing this will cause some problems? Although the object appears in the first frame, it is not selected

I mean is an undefined behavior of the model

Note: a limitation of the vos_inference.py script above is that currently it only supports VOS datasets where all objects to track already appear on frame 0 in each video (and therefore it doesn't apply to some datasets such as LVOS that have objects only appearing in the middle of a video).

PierreId commented 1 month ago

From what I have read in the code, the implementation expects that each frame with a prompt must contain the info of every object. Indeed, the memory bank stores data by frame id. And, for a frame with partial prompts, the existing data (computed when the prompt were added) will be reused without running any inference.

The only workaround that I found is to create one "inference_state" per object. To do that, I :

This way, I can track many objects (70 objects on 200 frames in my tests) without OOM (15GB of VRAM with the Large model) and reasonable inference time (Image Encoder only run once).

ayushjain1144 commented 1 month ago

Hi @PierreId, I am facing similar issues. Could you provide more details of your approach?

Specifically, for creating new inference_state per object, you do deepcopy(inference_state) -- but what do you mean by "except the images that are reused"?

I also didn't understand this part: "stored the "Image Encoder" outputs while running propagate_in_video() on the first object, and reused it for next object". Isn't the image encoder generate features when we run "init_state"

Thank you!

PierreId commented 1 month ago

Basically, I did something like that for creating one inference state per object :

# Create reference state
ref_inference_state = predictor.init_state(...)
ref_images = ref_inference_state['images']
ref_inference_state['images'] = None
# Create new state for object N
new_object_state = copy.deepcopy(ref_inference_state)
new_object_state['images'] = ref_images

And for the "Image Encoder", since it only depends on the input image, and not on the tracker (memory bank or object prompts), you can reuse it :

class SAM2VideoPredictor(SAM2Base):
    def __init__(...):
        ...
        self.stored_image_encoder_data = {}

...

    def _run_single_frame_inference(...):
        """Run tracking on a single frame based on current inputs and previous memory."""
        # Retrieve correct image features
        if frame_idx in self.last_image_encoder_data:
            current_vision_feats, current_vision_pos_embeds, feat_sizes = self.stored_image_encoder_data[frame_idx]
        else:
            (
                _,
                _,
                current_vision_feats,
                current_vision_pos_embeds,
                feat_sizes,
            ) = self._get_image_feature(inference_state, frame_idx, batch_size)

            # Save info
            self.stored_image_encoder_data[frame_idx] = current_vision_feats, current_vision_pos_embeds, feat_sizes

        ...
jeezrick commented 1 month ago

regarding the original problem of this issue, it's because frame 1 & frame 30 is labeled as cond_frame, and in code the track result gets directly returned when cond_frame. So the frame 30 and frame 1 segmentation result will be exactly the same as the one when you first add_point on it, there is no extra inference on memory in them. here is the code:

        for frame_idx in tqdm(processing_order, desc="propagate in video"):
            # We skip those frames already in consolidated outputs (these are frames
            # that received input clicks or mask). Note that we cannot directly run
            # batched forward on them via `_run_single_frame_inference` because the
            # number of clicks on each object might be different.
            if frame_idx in consolidated_frame_inds["cond_frame_outputs"]:
                storage_key = "cond_frame_outputs"
                current_out = output_dict[storage_key][frame_idx]
                pred_masks = current_out["pred_masks"]
                if clear_non_cond_mem:
                    # clear non-conditioning memory of the surrounding frames
                    self._clear_non_cond_mem_around_input(inference_state, frame_idx)
            elif frame_idx in consolidated_frame_inds["non_cond_frame_outputs"]:
                storage_key = "non_cond_frame_outputs"
                current_out = output_dict[storage_key][frame_idx]
                pred_masks = current_out["pred_masks"]
            else:
                storage_key = "non_cond_frame_outputs"
                current_out, pred_masks = self._run_single_frame_inference(
                    inference_state=inference_state,
                    output_dict=output_dict,
                    frame_idx=frame_idx,
                    batch_size=batch_size,
                    is_init_cond_frame=False,
                    point_inputs=None,
                    mask_inputs=None,
                    reverse=reverse,
                    run_mem_encoder=True,
                )
                output_dict[storage_key][frame_idx] = current_out
ayushjain1144 commented 1 month ago

Hi @PierreId , Thanks for your response. Caching image encoder features helped quite a bit.

About your first suggestion of creating a new inference state per object via deepcopy --- wouldn't just doing reset_state on the inference state work just as well? Concretely, I am thinking that we track an object with an inference state, then reset the inference state and track the next object and so on. I am thinking your approach is sequential too i.e. would track only one object at a time and then move to next. Am I missing something?

jeezrick commented 1 month ago

Hi @PierreId , Thanks for your response. Caching image encoder features helped quite a bit.

About your first suggestion of creating a new inference state per object via deepcopy --- wouldn't just doing reset_state on the inference state work just as well? Concretely, I am thinking that we track an object with an inference state, then reset the inference state and track the next object and so on. I am thinking your approach is sequential too i.e. would track only one object at a time and then move to next. Am I missing something?

I think what he trying to do is track multiple thing simultaneously.

ayushjain1144 commented 1 month ago

I see, and that would be amazing, but do you know how he is doing that? I am thinking that propagate_in_video needs to be called with one single inference state, so unless we do some multithreading, how can we run multiple propagate_in_video with different inference states?

PierreId commented 3 weeks ago

Indeed, I want to track multiple object simultaneously. One use-case could be to integrate it in an annotation tool (like CVAT) and track multiple objects:

Doing that, I can process a video of 900 frames, with 100 tracked objects using <9GB of VRAM with the Large model and <7.5GB with the Tiny model.

christian-5-28 commented 2 weeks ago

Hi @PierreId, thank you for sharing your findings! I have one question. Did you notice any drop in speed-performance correlated to the number of objects to be tracked in the video? I experienced almost a linear correlation between FPS degradation and number of tracked objects. Can you share the FPS rate for your example of 100 objects in 900 frames?

Thank you so much in advance.

PierreId commented 2 weeks ago

Hi @christian-5-28, aside for the image encoder, everything else is object-dependent. So, if you have 10 objects, it will require x10 more computations to be done (compared to 1 object). In terms of speed, with a RTX3090, I roughly get 10im/s for image encoder and 40im/s for each object tracked. With a RTX5000, it goes twice faster.

christian-5-28 commented 2 weeks ago

Thanks @PierreId for the quick response, so for your example of 100 tracked objects in 900 frames, it roughly takes 22.5 seconds to process the 900 frames for a single object, so in total it takes 37.5 minutes to process 900 frames with 100 tracked objects, right?

Do you have any idea from where this "sequentiality" comes from (model architecture design choices or code inefficiencies)? Sorry if it is a "obvious" question but I just started to dig into the details of the model and of the repository. I think being able to track multiple objects in parallel without increasing the processing time linearly would open many more use cases for this technology.

Thanks again in advance.

PierreId commented 2 weeks ago

@christian-5-28 : the official SAM2 code is parallel, but the amount of RAM required to track multiple objects is very high. This is (most probably) why it is only possible to track 3 objects on the demo website and why only 3 objects at a time where tracked during training. The modifications that I (and others) made to make it sequential are to allow tracking more objects on longer videos.

And, in my video, the objects are not in the scene during the whole time. So, this is much faster than 37.5min :)