Open RotsteinNoam opened 2 months ago
Hello RotsteinNoam,
I believe this occurs primarily for logistical simplicity. If you review the code, you'll notice that the status tracking_has_started changes to True
when you invoke propagate_in_video
. It then reverts to False
upon executing reset_state
. Throughout this process, the video_predictor
attempts to identify all prompts and returns obj_ids
and video_res_masks
, ensuring that the length of video_res_masks
matches the number of obj_ids
.
Additionally, if the video_predictor
were to allow users to add IDs
or prompts during the prediction phase, it would significantly complicate the logistics and potentially lead to conflicts during data merging. This constraint helps maintain the integrity and consistency of the tracking process.
Susan
Hello @ShuoShenDe,
Thank you for the clarification on why adding new IDs during inference is limited. I understand the challenges related to maintaining the integrity and consistency of the tracking process.
However, I am dealing with scenarios where objects in the video may come and go, and it's not feasible to know all object IDs ahead of time. Could you suggest an approach or modification to the existing codebase that could handle such dynamic changes in object IDs during video processing? Thanks!!!
Hi~Dear zly,
I recomment you to read my another repo,which combine with continuous adding the new objects function. See grounded_sam2_tracking_demo_with_continuous_id_gd1.5.py or grounded_sam2_tracking_demo_with_continuous_id
Susan
Hello @ShuoShenDe,
Thank you for the clarification on why adding new IDs during inference is limited. I understand the challenges related to maintaining the integrity and consistency of the tracking process.
However, I am dealing with scenarios where objects in the video may come and go, and it's not feasible to know all object IDs ahead of time. Could you suggest an approach or modification to the existing codebase that could handle such dynamic changes in object IDs during video processing? Thanks!!!
What is the reason for this limitation? ... overcoming this issue if the IDs are not known ahead of time
The video segmentation example seems like it's built for use in an interactive environment, not for automation.
If you wanted to add new object IDs as they appear, I think the best way would be to start a new inference_state
(see the video notebook example) for each unique object, and just store each ID as 0 inside the inference state. Then you can manually manage the separate inference states as if they're separate objects. From a performance/optimization standpoint, you'd also want to be careful not to duplicate the images data inside the inference state (i.e. instead have each separate state share a single copy of this data).
However, with doing this, you still need to run the 'propagate_in_video` function for each separate object (inference state) as they appear, which could be quite heavy. There doesn't seem to be a way around this as the frame loop is inside the sam2_video_predictor, so you'd have to extract it to be able to process dynamically appearing objects on a shared video loop.
I have a re-worked video example that keeps the loop outside the model, and might be easier to work with (each object is represented by 4 lists separate from the model state), though it's far from finalized, so only recommended for experimenting atm.
If you wanted to add new object IDs as they appear, I think the best way would be to start a new
inference_state
(see the video notebook example) for each unique object, and just store each ID as 0 inside the inference state. Then you can manually manage the separate inference states as if they're separate objects. From a performance/optimization standpoint, you'd also want to be careful not to duplicate the images data inside the inference state (i.e. instead have each separate state share a single copy of this data).However, with doing this, you still need to run the 'propagate_in_video` function for each separate object (inference state) as they appear, which could be quite heavy. There doesn't seem to be a way around this as the frame loop is inside the sam2_video_predictor, so you'd have to extract it to be able to process dynamically appearing objects on a shared video loop.
I'm actually looking to implement a similar approach to what @heyoeyo mentioned above for a video stream instead of a file, based on the segment-anything-model-2-real-time repository by @Gy920).
My desired outcome for the implementation would have the following workflow:
inference_state
upon new object detection/prompt in the stream. add_new_point_or_box
(or add_new_prompts
For the new inference state
. (So for each new object in the video, there is an inference state and corresponding output logit(s)).track
function, which invokes the propagate_in_video_preflight
and track_step
.Here, My understanding is that the downstream code in the track
function needs to run for each of the inference_state
currently existing and this leads to some uncertainties - I would greatly appreciate it if I could get a second pair of eyes for insight.
a. It would not make sense to run _get_feature
for each inference_state
since they all share the frame, but to get features for all of the inference_state
s we would need to set the batch_size
value going into the _get_feature
as an arg to match the total number of objects - how should this mismatch be approached? I'm not sure if treating the batch_size as the total number of objects when each inference state is treated independently will lead to downstream issues.
b. Once the track_step
is invoked on the current features, we get a current_out
output with corresponding mask(s). But since my implementation is running this for each inference_state
, I need a way to combine all the masks and return a single logit that encompasses all objects for the particular frame - can this be done by just adding the pred_masks together or is there a specific operation for this?
I know the question is long-winded. Thank you for the time and help!
Hi @bhoo-git
As a follow-up, I think it would be simplest to take the image encoding step outside of the run_single_frame_inference function and then add a loop over a 'list/dictionary of inference states' inside of the existing frame loop. Ideally the loop would be fully outside the video predictor class even (but that might require some additional modifications). So having something like:
# Asssuming this is available/managed somewhere else
inference_state_per_objid = {0: ..., 1: ...}
# Original frame loop
for frame_idx in tqdm(processing_order, desc="propagate in video"):
# Get shared image features, used by all objects for this frame
# -> might want to take 'inference_state["images"]' data outside
# of inference states, since it should also be shared and
# would make this step easier to implement
shared_image_features = self._get_image_feature(...)
# On each frame, perform update for every 'object' (i.e. inference state) independently
results_by_objid = {}
for objid, inference_state in inference_state_per_objid.items():
output_dict = inference_state["output_dict"]
# do the same thing as original frame loop for each inference state
# except use the shared image features instead of re-computing
results_by_objid[objid] = video_res_masks
# Similar to original output if needed
obj_ids = results_by_objid.keys()
yield frame_idx, obj_ids, results_by_objid
Getting this working with dynamically created inference states will be the hard part I think, since you may have to 'fake' old data in newly created states for the existing code to work properly (or otherwise update the parts that try to access old data so they don't crash if it's missing)
we would need to set the batch_size value going into the _get_feature as an arg to match the total number of objects
It's probably simpler to start by just running each inference state separately in a loop like above (so each inference state uses a batch size of 1). It might be possible to use batching across all the objects, but it would be extremely complex to also batch the prompts/memory steps.
I need a way to combine all the masks and return a single logit that encompasses all objects for the particular frame - can this be done by just adding the pred_masks together or is there a specific operation for this?
If you want to combine everything into a single mask that's like a logit, taking the max across all masks is probably best (i.e. the per-pixel max across all object masks). Otherwise just adding the raw prediction values could cause positive/negative regions from different objects to cancel in weird ways. Alternatively you could take the 'bitwise OR' of all thresholded masks if you don't need it to be logit-like, and just want the final (binary) result.
Has anyone found a solution to this? @heyoeyo @bhoo-git @zly7
Has anyone found a solution to this?
I have a multi-object version of the script I mentioned in an earlier post that keeps track of objects separately (as well as an interactive version. Edit: Here's an explanation & video example for these scripts), but it's based code that is more heavily modified than what was discussed above (e.g. it's not using copies of the inference state).
@ShuoShenDe is there any way to run your GroundedSAM2 code on videos in real-time (aka not having all of the video available at run time?)
@heyoeyo Thank you so much for the link! It works well. I'm curious if you have found a solution for including a semantic object detection network. If you include such a network and incorporate predicted objects from each frame as prompts to SAM2, then you will get many re-prompts of the same objects. Have you faced this and come up with a solution for it?
I'm curious if you have found a solution for including a semantic object detection network
I haven't tried it, but in theory it could work as an automated prompt to track newly appearing objects. The idea would be to run the detector on each frame and whenever a high confidence detection is found that doesn't have significant overlap (e.g. IoU) with an existing SAM prediction, it could be assumed to be a new object. Then that detection would be used to generate a 'prompt encoding' to begin tracking with SAM.
If you include such a network and incorporate predicted objects from each frame as prompts to SAM2, then you will get many re-prompts of the same objects
From what I've seen, the SAM model works quite well with only a single prompt encoding and then relying on the 'recent frame encodings' to keep track of objects that change appearance over time. Maybe every few seconds it's worth updating the prompt encoding based on a newer detection (assuming a high IoU with the SAM prediction), but I think it's probably best to use very few (or just 1) prompt encoding and replace them rather than accumulating lots of them.
@heyoeyo Thanks! Yes, I'd like to replace prompts rather than accumulating them, but transferring instance/object tracking IDs over from frame to frame if you have newer detections is a challenge I am facing with that approach. I am currently taking the first approach you mentioned, of trying to determine if the prompt is a repeat of an existing object that is being tracked.
One challenge with this is having to run the object propagation twice. Imagine this:
for frame in video:
propagate object_ids through frame
run semantic segmentation network and retrieve a prompt for each object detected
compare new prompts with existing objects in current frame by pixel location
add new objects to list of prompts
propagate new object_ids through frame
Otherwise, you are comparing semantic detections on frame i against objects tracked through frame i-1.
Yes, I'd like to replace prompts rather than accumulating them, but transferring instance/object tracking IDs over from frame to frame if you have newer detections is a challenge
If you're using the original code base, then there's a config parameter (max_cond_frames_in_attn) that can be set to 1 (it has to be added to the .yaml config file) which will have this sort of 'replace the prompt encoding' effect. It doesn't avoid accumulating the encodings internally, but may be simpler to use compared to manually replacing the encodings. (actually I think the setting has to be 2, because of an odd assert statement, unless you comment it out!)
One challenge with this is having to run the object propagation twice... Otherwise, you are comparing semantic detections on frame i against objects tracked through frame i-1.
That makes sense, I think updating the known objects is needed regardless since you'd want the mask predictions anyways (and as you say, they're needed to compare everything on the same frame). So the main concern would be not duplicating work for the new objects. If you're using the original code, then the update for the known objects (roughly) corresponds to the track_step
function, whereas the function need for new objects is add_new_points
(or similar). If it's possible to call these separately (and with track_step
called before add_new_points
), I think you can at least avoid duplicating any of the computation.
Depending on your use case, maybe it also makes sense to only run the semantic/new object stuff every other frame (or less) while running SAM on every frame? That could reduce the processing a lot in exchange for detecting objects 1 frame late sometimes (which could be acceptable?).
For those interested in seeing the results of this experiment--here is my fork of muggled_sam
that works with Mask2Former to retrieve semantics. Clone Mask2Former in the same directory as you've cloned muggled_sam
and it should work! Thanks for the helpful comments @heyoeyo
https://github.com/umfieldrobotics/muggled_sam
https://github.com/user-attachments/assets/354297f5-1394-47b5-ada0-4f03a1c65392
Hi, I noticed that new IDs cannot be added during inference on videos: https://github.com/facebookresearch/segment-anything-2/blob/6186d1529a9c26f7b6e658f3e704d4bee386d9ba/sam2/sam2_video_predictor.py#L153C19-L153C43
What is the reason for this limitation? How do you suggest overcoming this issue if the IDs are not known ahead of time?