New object ids during video

RotsteinNoam commented 1 month ago

Hi, I noticed that new IDs cannot be added during inference on videos: https://github.com/facebookresearch/segment-anything-2/blob/6186d1529a9c26f7b6e658f3e704d4bee386d9ba/sam2/sam2_video_predictor.py#L153C19-L153C43

What is the reason for this limitation? How do you suggest overcoming this issue if the IDs are not known ahead of time?

ShuoShenDe commented 1 month ago

Hello RotsteinNoam,

I believe this occurs primarily for logistical simplicity. If you review the code, you'll notice that the status tracking_has_started changes to Truewhen you invoke propagate_in_video. It then reverts to False upon executing reset_state. Throughout this process, the video_predictor attempts to identify all prompts and returns obj_idsand video_res_masks, ensuring that the length of video_res_masksmatches the number of obj_ids.

Additionally, if the video_predictor were to allow users to add IDs or prompts during the prediction phase, it would significantly complicate the logistics and potentially lead to conflicts during data merging. This constraint helps maintain the integrity and consistency of the tracking process.

Susan

zly7 commented 1 month ago

Hello @ShuoShenDe,

Thank you for the clarification on why adding new IDs during inference is limited. I understand the challenges related to maintaining the integrity and consistency of the tracking process.

However, I am dealing with scenarios where objects in the video may come and go, and it's not feasible to know all object IDs ahead of time. Could you suggest an approach or modification to the existing codebase that could handle such dynamic changes in object IDs during video processing? Thanks!!!

ShuoShenDe commented 1 month ago

Hi~Dear zly,

I recomment you to read my another repo,which combine with continuous adding the new objects function. See grounded_sam2_tracking_demo_with_continuous_id_gd1.5.py or grounded_sam2_tracking_demo_with_continuous_id

Susan

Hello @ShuoShenDe,

Thank you for the clarification on why adding new IDs during inference is limited. I understand the challenges related to maintaining the integrity and consistency of the tracking process.

However, I am dealing with scenarios where objects in the video may come and go, and it's not feasible to know all object IDs ahead of time. Could you suggest an approach or modification to the existing codebase that could handle such dynamic changes in object IDs during video processing? Thanks!!!

heyoeyo commented 1 month ago

What is the reason for this limitation? ... overcoming this issue if the IDs are not known ahead of time

The video segmentation example seems like it's built for use in an interactive environment, not for automation.

If you wanted to add new object IDs as they appear, I think the best way would be to start a new inference_state (see the video notebook example) for each unique object, and just store each ID as 0 inside the inference state. Then you can manually manage the separate inference states as if they're separate objects. From a performance/optimization standpoint, you'd also want to be careful not to duplicate the images data inside the inference state (i.e. instead have each separate state share a single copy of this data).

However, with doing this, you still need to run the 'propagate_in_video` function for each separate object (inference state) as they appear, which could be quite heavy. There doesn't seem to be a way around this as the frame loop is inside the sam2_video_predictor, so you'd have to extract it to be able to process dynamically appearing objects on a shared video loop.

I have a re-worked video example that keeps the loop outside the model, and might be easier to work with (each object is represented by 4 lists separate from the model state), though it's far from finalized, so only recommended for experimenting atm.

bhoo-git commented 1 day ago

If you wanted to add new object IDs as they appear, I think the best way would be to start a new inference_state (see the video notebook example) for each unique object, and just store each ID as 0 inside the inference state. Then you can manually manage the separate inference states as if they're separate objects. From a performance/optimization standpoint, you'd also want to be careful not to duplicate the images data inside the inference state (i.e. instead have each separate state share a single copy of this data).

However, with doing this, you still need to run the 'propagate_in_video` function for each separate object (inference state) as they appear, which could be quite heavy. There doesn't seem to be a way around this as the frame loop is inside the sam2_video_predictor, so you'd have to extract it to be able to process dynamically appearing objects on a shared video loop.

I'm actually looking to implement a similar approach to what @heyoeyo mentioned above for a video stream instead of a file, based on the segment-anything-model-2-real-time repository by @Gy920).

My desired outcome for the implementation would have the following workflow:

Dynamically initialize a new inference_state upon new object detection/prompt in the stream.
Run add_new_point_or_box (or add_new_prompts For the new inference state. (So for each new object in the video, there is an inference state and corresponding output logit(s)).
Use the track function, which invokes the propagate_in_video_preflight and track_step.

Here, My understanding is that the downstream code in the track function needs to run for each of the inference_state currently existing and this leads to some uncertainties - I would greatly appreciate it if I could get a second pair of eyes for insight.

a. It would not make sense to run _get_feature for each inference_state since they all share the frame, but to get features for all of the inference_states we would need to set the batch_size value going into the _get_feature as an arg to match the total number of objects - how should this mismatch be approached? I'm not sure if treating the batch_size as the total number of objects when each inference state is treated independently will lead to downstream issues.

b. Once the track_step is invoked on the current features, we get a current_out output with corresponding mask(s). But since my implementation is running this for each inference_state, I need a way to combine all the masks and return a single logit that encompasses all objects for the particular frame - can this be done by just adding the pred_masks together or is there a specific operation for this?

I know the question is long-winded. Thank you for the time and help!

heyoeyo commented 12 hours ago

Hi @bhoo-git

As a follow-up, I think it would be simplest to take the image encoding step outside of the run_single_frame_inference function and then add a loop over a 'list/dictionary of inference states' inside of the existing frame loop. Ideally the loop would be fully outside the video predictor class even (but that might require some additional modifications). So having something like:


# Asssuming this is available/managed somewhere else
inference_state_per_objid = {0: ..., 1: ...}

# Original frame loop
for frame_idx in tqdm(processing_order, desc="propagate in video"):

  # Get shared image features, used by all objects for this frame
  # -> might want to take 'inference_state["images"]' data outside
  #    of inference states, since it should also be shared and
  #    would make this step easier to implement
  shared_image_features = self._get_image_feature(...)

  # On each frame, perform update for every 'object' (i.e. inference state) independently
  results_by_objid = {}
  for objid, inference_state in inference_state_per_objid.items():

    output_dict = inference_state["output_dict"]
    # do the same thing as original frame loop for each inference state
    # except use the shared image features instead of re-computing
    results_by_objid[objid] = video_res_masks

  # Similar to original output if needed
  obj_ids = results_by_objid.keys()
  yield frame_idx, obj_ids, results_by_objid

Getting this working with dynamically created inference states will be the hard part I think, since you may have to 'fake' old data in newly created states for the existing code to work properly (or otherwise update the parts that try to access old data so they don't crash if it's missing)

we would need to set the batch_size value going into the _get_feature as an arg to match the total number of objects

It's probably simpler to start by just running each inference state separately in a loop like above (so each inference state uses a batch size of 1). It might be possible to use batching across all the objects, but it would be extremely complex to also batch the prompts/memory steps.

I need a way to combine all the masks and return a single logit that encompasses all objects for the particular frame - can this be done by just adding the pred_masks together or is there a specific operation for this?

If you want to combine everything into a single mask that's like a logit, taking the max across all masks is probably best (i.e. the per-pixel max across all object masks). Otherwise just adding the raw prediction values could cause positive/negative regions from different objects to cancel in weird ways. Alternatively you could take the 'bitwise OR' of all thresholded masks if you don't need it to be logit-like, and just want the final (binary) result.

facebookresearch / segment-anything-2

New object ids during video #185