Open RotsteinNoam opened 1 month ago
Hello RotsteinNoam,
I believe this occurs primarily for logistical simplicity. If you review the code, you'll notice that the status tracking_has_started changes to True
when you invoke propagate_in_video
. It then reverts to False
upon executing reset_state
. Throughout this process, the video_predictor
attempts to identify all prompts and returns obj_ids
and video_res_masks
, ensuring that the length of video_res_masks
matches the number of obj_ids
.
Additionally, if the video_predictor
were to allow users to add IDs
or prompts during the prediction phase, it would significantly complicate the logistics and potentially lead to conflicts during data merging. This constraint helps maintain the integrity and consistency of the tracking process.
Susan
Hello @ShuoShenDe,
Thank you for the clarification on why adding new IDs during inference is limited. I understand the challenges related to maintaining the integrity and consistency of the tracking process.
However, I am dealing with scenarios where objects in the video may come and go, and it's not feasible to know all object IDs ahead of time. Could you suggest an approach or modification to the existing codebase that could handle such dynamic changes in object IDs during video processing? Thanks!!!
Hi~Dear zly,
I recomment you to read my another repo,which combine with continuous adding the new objects function. See grounded_sam2_tracking_demo_with_continuous_id_gd1.5.py or grounded_sam2_tracking_demo_with_continuous_id
Susan
Hello @ShuoShenDe,
Thank you for the clarification on why adding new IDs during inference is limited. I understand the challenges related to maintaining the integrity and consistency of the tracking process.
However, I am dealing with scenarios where objects in the video may come and go, and it's not feasible to know all object IDs ahead of time. Could you suggest an approach or modification to the existing codebase that could handle such dynamic changes in object IDs during video processing? Thanks!!!
What is the reason for this limitation? ... overcoming this issue if the IDs are not known ahead of time
The video segmentation example seems like it's built for use in an interactive environment, not for automation.
If you wanted to add new object IDs as they appear, I think the best way would be to start a new inference_state
(see the video notebook example) for each unique object, and just store each ID as 0 inside the inference state. Then you can manually manage the separate inference states as if they're separate objects. From a performance/optimization standpoint, you'd also want to be careful not to duplicate the images data inside the inference state (i.e. instead have each separate state share a single copy of this data).
However, with doing this, you still need to run the 'propagate_in_video` function for each separate object (inference state) as they appear, which could be quite heavy. There doesn't seem to be a way around this as the frame loop is inside the sam2_video_predictor, so you'd have to extract it to be able to process dynamically appearing objects on a shared video loop.
I have a re-worked video example that keeps the loop outside the model, and might be easier to work with (each object is represented by 4 lists separate from the model state), though it's far from finalized, so only recommended for experimenting atm.
If you wanted to add new object IDs as they appear, I think the best way would be to start a new
inference_state
(see the video notebook example) for each unique object, and just store each ID as 0 inside the inference state. Then you can manually manage the separate inference states as if they're separate objects. From a performance/optimization standpoint, you'd also want to be careful not to duplicate the images data inside the inference state (i.e. instead have each separate state share a single copy of this data).However, with doing this, you still need to run the 'propagate_in_video` function for each separate object (inference state) as they appear, which could be quite heavy. There doesn't seem to be a way around this as the frame loop is inside the sam2_video_predictor, so you'd have to extract it to be able to process dynamically appearing objects on a shared video loop.
I'm actually looking to implement a similar approach to what @heyoeyo mentioned above for a video stream instead of a file, based on the segment-anything-model-2-real-time repository by @Gy920).
My desired outcome for the implementation would have the following workflow:
inference_state
upon new object detection/prompt in the stream. add_new_point_or_box
(or add_new_prompts
For the new inference state
. (So for each new object in the video, there is an inference state and corresponding output logit(s)).track
function, which invokes the propagate_in_video_preflight
and track_step
.Here, My understanding is that the downstream code in the track
function needs to run for each of the inference_state
currently existing and this leads to some uncertainties - I would greatly appreciate it if I could get a second pair of eyes for insight.
a. It would not make sense to run _get_feature
for each inference_state
since they all share the frame, but to get features for all of the inference_state
s we would need to set the batch_size
value going into the _get_feature
as an arg to match the total number of objects - how should this mismatch be approached? I'm not sure if treating the batch_size as the total number of objects when each inference state is treated independently will lead to downstream issues.
b. Once the track_step
is invoked on the current features, we get a current_out
output with corresponding mask(s). But since my implementation is running this for each inference_state
, I need a way to combine all the masks and return a single logit that encompasses all objects for the particular frame - can this be done by just adding the pred_masks together or is there a specific operation for this?
I know the question is long-winded. Thank you for the time and help!
Hi @bhoo-git
As a follow-up, I think it would be simplest to take the image encoding step outside of the run_single_frame_inference function and then add a loop over a 'list/dictionary of inference states' inside of the existing frame loop. Ideally the loop would be fully outside the video predictor class even (but that might require some additional modifications). So having something like:
# Asssuming this is available/managed somewhere else
inference_state_per_objid = {0: ..., 1: ...}
# Original frame loop
for frame_idx in tqdm(processing_order, desc="propagate in video"):
# Get shared image features, used by all objects for this frame
# -> might want to take 'inference_state["images"]' data outside
# of inference states, since it should also be shared and
# would make this step easier to implement
shared_image_features = self._get_image_feature(...)
# On each frame, perform update for every 'object' (i.e. inference state) independently
results_by_objid = {}
for objid, inference_state in inference_state_per_objid.items():
output_dict = inference_state["output_dict"]
# do the same thing as original frame loop for each inference state
# except use the shared image features instead of re-computing
results_by_objid[objid] = video_res_masks
# Similar to original output if needed
obj_ids = results_by_objid.keys()
yield frame_idx, obj_ids, results_by_objid
Getting this working with dynamically created inference states will be the hard part I think, since you may have to 'fake' old data in newly created states for the existing code to work properly (or otherwise update the parts that try to access old data so they don't crash if it's missing)
we would need to set the batch_size value going into the _get_feature as an arg to match the total number of objects
It's probably simpler to start by just running each inference state separately in a loop like above (so each inference state uses a batch size of 1). It might be possible to use batching across all the objects, but it would be extremely complex to also batch the prompts/memory steps.
I need a way to combine all the masks and return a single logit that encompasses all objects for the particular frame - can this be done by just adding the pred_masks together or is there a specific operation for this?
If you want to combine everything into a single mask that's like a logit, taking the max across all masks is probably best (i.e. the per-pixel max across all object masks). Otherwise just adding the raw prediction values could cause positive/negative regions from different objects to cancel in weird ways. Alternatively you could take the 'bitwise OR' of all thresholded masks if you don't need it to be logit-like, and just want the final (binary) result.
Hi, I noticed that new IDs cannot be added during inference on videos: https://github.com/facebookresearch/segment-anything-2/blob/6186d1529a9c26f7b6e658f3e704d4bee386d9ba/sam2/sam2_video_predictor.py#L153C19-L153C43
What is the reason for this limitation? How do you suggest overcoming this issue if the IDs are not known ahead of time?