facebookresearch / sam2

The repository provides code for running inference with the Meta Segment Anything Model 2 (SAM 2), links for downloading the trained model checkpoints, and example notebooks that show how to use the model.
Apache License 2.0
11.93k stars 1.06k forks source link

New object ids during video #185

Open RotsteinNoam opened 2 months ago

RotsteinNoam commented 2 months ago

Hi, I noticed that new IDs cannot be added during inference on videos: https://github.com/facebookresearch/segment-anything-2/blob/6186d1529a9c26f7b6e658f3e704d4bee386d9ba/sam2/sam2_video_predictor.py#L153C19-L153C43

What is the reason for this limitation? How do you suggest overcoming this issue if the IDs are not known ahead of time?

ShuoShenDe commented 2 months ago

Hello RotsteinNoam,

I believe this occurs primarily for logistical simplicity. If you review the code, you'll notice that the status tracking_has_started changes to Truewhen you invoke propagate_in_video. It then reverts to False upon executing reset_state. Throughout this process, the video_predictor attempts to identify all prompts and returns obj_idsand video_res_masks, ensuring that the length of video_res_masksmatches the number of obj_ids.

Additionally, if the video_predictor were to allow users to add IDs or prompts during the prediction phase, it would significantly complicate the logistics and potentially lead to conflicts during data merging. This constraint helps maintain the integrity and consistency of the tracking process.

Susan

zly7 commented 2 months ago

Hello @ShuoShenDe,

Thank you for the clarification on why adding new IDs during inference is limited. I understand the challenges related to maintaining the integrity and consistency of the tracking process.

However, I am dealing with scenarios where objects in the video may come and go, and it's not feasible to know all object IDs ahead of time. Could you suggest an approach or modification to the existing codebase that could handle such dynamic changes in object IDs during video processing? Thanks!!!

ShuoShenDe commented 2 months ago

Hi~Dear zly,

I recomment you to read my another repo,which combine with continuous adding the new objects function. See grounded_sam2_tracking_demo_with_continuous_id_gd1.5.py or grounded_sam2_tracking_demo_with_continuous_id

Susan

Hello @ShuoShenDe,

Thank you for the clarification on why adding new IDs during inference is limited. I understand the challenges related to maintaining the integrity and consistency of the tracking process.

However, I am dealing with scenarios where objects in the video may come and go, and it's not feasible to know all object IDs ahead of time. Could you suggest an approach or modification to the existing codebase that could handle such dynamic changes in object IDs during video processing? Thanks!!!

heyoeyo commented 2 months ago

What is the reason for this limitation? ... overcoming this issue if the IDs are not known ahead of time

The video segmentation example seems like it's built for use in an interactive environment, not for automation.

If you wanted to add new object IDs as they appear, I think the best way would be to start a new inference_state (see the video notebook example) for each unique object, and just store each ID as 0 inside the inference state. Then you can manually manage the separate inference states as if they're separate objects. From a performance/optimization standpoint, you'd also want to be careful not to duplicate the images data inside the inference state (i.e. instead have each separate state share a single copy of this data).

However, with doing this, you still need to run the 'propagate_in_video` function for each separate object (inference state) as they appear, which could be quite heavy. There doesn't seem to be a way around this as the frame loop is inside the sam2_video_predictor, so you'd have to extract it to be able to process dynamically appearing objects on a shared video loop.

I have a re-worked video example that keeps the loop outside the model, and might be easier to work with (each object is represented by 4 lists separate from the model state), though it's far from finalized, so only recommended for experimenting atm.

bhoo-git commented 1 month ago

If you wanted to add new object IDs as they appear, I think the best way would be to start a new inference_state (see the video notebook example) for each unique object, and just store each ID as 0 inside the inference state. Then you can manually manage the separate inference states as if they're separate objects. From a performance/optimization standpoint, you'd also want to be careful not to duplicate the images data inside the inference state (i.e. instead have each separate state share a single copy of this data).

However, with doing this, you still need to run the 'propagate_in_video` function for each separate object (inference state) as they appear, which could be quite heavy. There doesn't seem to be a way around this as the frame loop is inside the sam2_video_predictor, so you'd have to extract it to be able to process dynamically appearing objects on a shared video loop.

I'm actually looking to implement a similar approach to what @heyoeyo mentioned above for a video stream instead of a file, based on the segment-anything-model-2-real-time repository by @Gy920).

My desired outcome for the implementation would have the following workflow:

  1. Dynamically initialize a new inference_state upon new object detection/prompt in the stream.
  2. Run add_new_point_or_box (or add_new_prompts For the new inference state. (So for each new object in the video, there is an inference state and corresponding output logit(s)).
  3. Use the track function, which invokes the propagate_in_video_preflight and track_step.

Here, My understanding is that the downstream code in the track function needs to run for each of the inference_state currently existing and this leads to some uncertainties - I would greatly appreciate it if I could get a second pair of eyes for insight.

a. It would not make sense to run _get_feature for each inference_state since they all share the frame, but to get features for all of the inference_states we would need to set the batch_size value going into the _get_feature as an arg to match the total number of objects - how should this mismatch be approached? I'm not sure if treating the batch_size as the total number of objects when each inference state is treated independently will lead to downstream issues.

b. Once the track_step is invoked on the current features, we get a current_out output with corresponding mask(s). But since my implementation is running this for each inference_state, I need a way to combine all the masks and return a single logit that encompasses all objects for the particular frame - can this be done by just adding the pred_masks together or is there a specific operation for this?

I know the question is long-winded. Thank you for the time and help!

heyoeyo commented 1 month ago

Hi @bhoo-git

As a follow-up, I think it would be simplest to take the image encoding step outside of the run_single_frame_inference function and then add a loop over a 'list/dictionary of inference states' inside of the existing frame loop. Ideally the loop would be fully outside the video predictor class even (but that might require some additional modifications). So having something like:


# Asssuming this is available/managed somewhere else
inference_state_per_objid = {0: ..., 1: ...}

# Original frame loop
for frame_idx in tqdm(processing_order, desc="propagate in video"):

  # Get shared image features, used by all objects for this frame
  # -> might want to take 'inference_state["images"]' data outside
  #    of inference states, since it should also be shared and
  #    would make this step easier to implement
  shared_image_features = self._get_image_feature(...)

  # On each frame, perform update for every 'object' (i.e. inference state) independently
  results_by_objid = {}
  for objid, inference_state in inference_state_per_objid.items():

    output_dict = inference_state["output_dict"]
    # do the same thing as original frame loop for each inference state
    # except use the shared image features instead of re-computing
    results_by_objid[objid] = video_res_masks

  # Similar to original output if needed
  obj_ids = results_by_objid.keys()
  yield frame_idx, obj_ids, results_by_objid

Getting this working with dynamically created inference states will be the hard part I think, since you may have to 'fake' old data in newly created states for the existing code to work properly (or otherwise update the parts that try to access old data so they don't crash if it's missing)

we would need to set the batch_size value going into the _get_feature as an arg to match the total number of objects

It's probably simpler to start by just running each inference state separately in a loop like above (so each inference state uses a batch size of 1). It might be possible to use batching across all the objects, but it would be extremely complex to also batch the prompts/memory steps.

I need a way to combine all the masks and return a single logit that encompasses all objects for the particular frame - can this be done by just adding the pred_masks together or is there a specific operation for this?

If you want to combine everything into a single mask that's like a logit, taking the max across all masks is probably best (i.e. the per-pixel max across all object masks). Otherwise just adding the raw prediction values could cause positive/negative regions from different objects to cancel in weird ways. Alternatively you could take the 'bitwise OR' of all thresholded masks if you don't need it to be logit-like, and just want the final (binary) result.

anja-sheppard commented 1 week ago

Has anyone found a solution to this? @heyoeyo @bhoo-git @zly7

heyoeyo commented 1 week ago

Has anyone found a solution to this?

I have a multi-object version of the script I mentioned in an earlier post that keeps track of objects separately (as well as an interactive version. Edit: Here's an explanation & video example for these scripts), but it's based code that is more heavily modified than what was discussed above (e.g. it's not using copies of the inference state).

anja-sheppard commented 5 days ago

@ShuoShenDe is there any way to run your GroundedSAM2 code on videos in real-time (aka not having all of the video available at run time?)

anja-sheppard commented 4 days ago

@heyoeyo Thank you so much for the link! It works well. I'm curious if you have found a solution for including a semantic object detection network. If you include such a network and incorporate predicted objects from each frame as prompts to SAM2, then you will get many re-prompts of the same objects. Have you faced this and come up with a solution for it?

heyoeyo commented 4 days ago

I'm curious if you have found a solution for including a semantic object detection network

I haven't tried it, but in theory it could work as an automated prompt to track newly appearing objects. The idea would be to run the detector on each frame and whenever a high confidence detection is found that doesn't have significant overlap (e.g. IoU) with an existing SAM prediction, it could be assumed to be a new object. Then that detection would be used to generate a 'prompt encoding' to begin tracking with SAM.

If you include such a network and incorporate predicted objects from each frame as prompts to SAM2, then you will get many re-prompts of the same objects

From what I've seen, the SAM model works quite well with only a single prompt encoding and then relying on the 'recent frame encodings' to keep track of objects that change appearance over time. Maybe every few seconds it's worth updating the prompt encoding based on a newer detection (assuming a high IoU with the SAM prediction), but I think it's probably best to use very few (or just 1) prompt encoding and replace them rather than accumulating lots of them.

anja-sheppard commented 4 days ago

@heyoeyo Thanks! Yes, I'd like to replace prompts rather than accumulating them, but transferring instance/object tracking IDs over from frame to frame if you have newer detections is a challenge I am facing with that approach. I am currently taking the first approach you mentioned, of trying to determine if the prompt is a repeat of an existing object that is being tracked.

One challenge with this is having to run the object propagation twice. Imagine this:

for frame in video:
    propagate object_ids through frame
    run semantic segmentation network and retrieve a prompt for each object detected
    compare new prompts with existing objects in current frame by pixel location
    add new objects to list of prompts
    propagate new object_ids through frame

Otherwise, you are comparing semantic detections on frame i against objects tracked through frame i-1.

heyoeyo commented 4 days ago

Yes, I'd like to replace prompts rather than accumulating them, but transferring instance/object tracking IDs over from frame to frame if you have newer detections is a challenge

If you're using the original code base, then there's a config parameter (max_cond_frames_in_attn) that can be set to 1 (it has to be added to the .yaml config file) which will have this sort of 'replace the prompt encoding' effect. It doesn't avoid accumulating the encodings internally, but may be simpler to use compared to manually replacing the encodings. (actually I think the setting has to be 2, because of an odd assert statement, unless you comment it out!)

One challenge with this is having to run the object propagation twice... Otherwise, you are comparing semantic detections on frame i against objects tracked through frame i-1.

That makes sense, I think updating the known objects is needed regardless since you'd want the mask predictions anyways (and as you say, they're needed to compare everything on the same frame). So the main concern would be not duplicating work for the new objects. If you're using the original code, then the update for the known objects (roughly) corresponds to the track_step function, whereas the function need for new objects is add_new_points (or similar). If it's possible to call these separately (and with track_step called before add_new_points), I think you can at least avoid duplicating any of the computation.

Depending on your use case, maybe it also makes sense to only run the semantic/new object stuff every other frame (or less) while running SAM on every frame? That could reduce the processing a lot in exchange for detecting objects 1 frame late sometimes (which could be acceptable?).

anja-sheppard commented 2 days ago

For those interested in seeing the results of this experiment--here is my fork of muggled_sam that works with Mask2Former to retrieve semantics. Clone Mask2Former in the same directory as you've cloned muggled_sam and it should work! Thanks for the helpful comments @heyoeyo

https://github.com/umfieldrobotics/muggled_sam

https://github.com/user-attachments/assets/354297f5-1394-47b5-ada0-4f03a1c65392