Is it possible to use SAM2 for real-time video applications?

giulio333 commented 1 month ago

I am interested in using SAM2 for real-time video processing applications. Could you please let me know if this is possible with the current implementation?

thad75 commented 1 month ago

I have run SAM2 tiny, small on a100 without any issue for real time application (25 fps +). Make sure flash-attention is installed though

giulio333 commented 1 month ago

Thanks, what is flash-attention?

rishabh297 commented 1 month ago

I have run SAM2 tiny, small on a100 without any issue for real time application (25 fps +). Make sure flash-attention is installed though

Hey, @thad75 can you please share how you were able to build SAM2 on a real-time application? Is there any Github that you have for this project?

ipmLessing commented 1 month ago

I have run SAM2 tiny, small on a100 without any issue for real time application (25 fps +). Make sure flash-attention is installed though

Are you using the video predictor or just a single image predictor? I'm also looking for the way to make the sam2 in real 'real-time' with small input sample and track the stuff that I want in the live video (images) stream.

Gy920 commented 1 month ago

I have implemented a demo of building SAM2 on a real-time application, which can be found at https://github.com/Gy920/segment-anything-2-real-time. However, there are still some issues (such as only being able to set the object to be segmented in the first frame).

I have run SAM2 tiny, small on a100 without any issue for real time application (25 fps +). Make sure flash-attention is installed though

Are you using the video predictor or just a single image predictor? I'm also looking for the way to make the sam2 in real 'real-time' with small input sample and track the stuff that I want in the live video (images) stream.

pingchesu commented 1 month ago

@Gy920 Thanks for implementing Sam2CameraPredictor, I have a question is that I will dynamicly add objects points in each frame, and use Sam2CameraPredictor to track each object that I added. Is that possible for Sam2CameraPredictor to meet this requirement?

Gy920 commented 1 month ago

@Gy920 Thanks for implementing Sam2CameraPredictor, I have a question is that I will dynamicly add objects points in each frame, and use Sam2CameraPredictor to track each object that I added. Is that possible for Sam2CameraPredictor to meet this requirement?

You need to reset Sam2CameraPredictor when adding new objects, and then re-add your object points in that frame. This may cause some performance overhead.

sachinsshetty commented 1 month ago

https://github.com/IDEA-Research/Grounded-SAM-2
Can be expanded to use real-time.

https://github.com/IDEA-Research/Grounded-SAM-2/blob/main/grounded_sam2_tracking_demo.py

Really cool work by @IDEA-Research

I making some experiments to use it for real time video at

https://github.com/sachinsshetty/segment-anything-2/tree/experiments

its a WIP,

mano3-1 commented 1 month ago

You need to reset Sam2CameraPredictor when adding new objects, and then re-add your object points in that frame. This may cause some performance overhead.

Thanks for the implementation @Gy920 . I've been working on tracking object using your code. But I noticed a drop in performance, as in when I use SAM's video predictor notebooks, the segmentations are accurate..but when I compare it with yours, there is a drop in performance..I would like to know if there is a logic change from original SAM's implementation and yours. Attaching the videos that I obtained from using video predictor notebook and camera sam also.

Video Predictor: here

Camera predictor: here

As you may see, the results from video predictor are consistent than camera predictor. Note: the prompt points are same for both videos.

Gy920 commented 1 month ago

You need to reset Sam2CameraPredictor when adding new objects, and then re-add your object points in that frame. This may cause some performance overhead.

Thanks for the implementation @Gy920 . I've been working on tracking object using your code. But I noticed a drop in performance, as in when I use SAM's video predictor notebooks, the segmentations are accurate..but when I compare it with yours, there is a drop in performance..I would like to know if there is a logic change from original SAM's implementation and yours. Attaching the videos that I obtained from using video predictor notebook and camera sam also.

Video Predictor: here

Camera predictor: here

As you may see, the results from video predictor are consistent than camera predictor. Note: the prompt points are same for both videos.

I noticed the colors are different between the two videos (R and B channels are swapped), as mentioned in this issue. The unsatisfactory result might be caused by this bug, which I'll fix later.

mano3-1 commented 1 month ago

I have converted the frames from BGR to RGB before feeding them to predictor.load_first_frame or predictor.track. Still the results are same.

Gy920 commented 1 month ago

I have converted the frames from BGR to RGB before feeding them to predictor.load_first_frame or predictor.track. Still the results are same.

I suspect it might be related to the conditions in part. I simplified the logic in this part during implementation, which could potentially lead to a decrease in tracking performance.

I'm a bit swamped at the moment, but I'll carve out some time to dig into this issue and get it sorted as soon as possible.

lihaowei1999 commented 1 month ago

@Gy920 Hi. I think I found the problem... Do not know if it is the same case on your side but it definitely worth a try. Add

cv2.imwrite("Temp/temp.jpg", img)
img = cv2.imread("Temp/temp.jpg")

to the front of perpare_data and the segmentation will be much more stable. Or simply apply:

img = cv2.imdecode(np.frombuffer(cv2.imencode('.jpg', img)[1], np.uint8), cv2.IMREAD_COLOR)

will be the same.

I think SAM2 is sensitive to the noise generated by jpg. If you change to png, the result will be not be as good.

jinxianwei commented 1 month ago

You need to reset Sam2CameraPredictor when adding new objects, and then re-add your object points in that frame. This may cause some performance overhead.

Thanks for the implementation @Gy920 . I've been working on tracking object using your code. But I noticed a drop in performance, as in when I use SAM's video predictor notebooks, the segmentations are accurate..but when I compare it with yours, there is a drop in performance..I would like to know if there is a logic change from original SAM's implementation and yours. Attaching the videos that I obtained from using video predictor notebook and camera sam also. Video Predictor: here Camera predictor: here As you may see, the results from video predictor are consistent than camera predictor. Note: the prompt points are same for both videos.

I noticed the colors are different between the two videos (R and B channels are swapped), as mentioned in this issue. The unsatisfactory result might be caused by this bug, which I'll fix later.

I think it's result from the self.condition_state['output_dict'] dont save the result durning inference, and i change it find the result better, and all changes as follows

mano3-1 commented 1 month ago

thanks @jinxianwei . Your fix has worked. The results are consistent now.

jellevhb commented 1 month ago

Thanks for the input everyone. I've tested all suggestions separately and in combination, and while the tracking consistency improved, the results are definitely not on par with the original video predictor yet. They are noisier and sometimes just disappear after a while.

I've also made some changes in the meantime which gave minor improvements, but still not as good as the original.

Use bicubic interpolation for resizing
Reset the frame_idx together with the rest of the state

@Gy920 Can you elaborate on the simplifications you did? Any further tips and/or solutions are greatly appreciated.

patrick-tssn commented 1 month ago

Hi all,

I've implemented a Grounded SAM2 for real-time video, inspired by @Gy920 and the @Grounded-SAM-2 project. This implementation supports natural language queries as input. Despite some latency, I hope it will be beneficial to the community.

https://github.com/patrick-tssn/Streaming-Grounded-SAM-2

devinli123 commented 1 month ago

I have implemented a demo of building SAM2 on a real-time application, which can be found at https://github.com/Gy920/segment-anything-2-real-time. However, there are still some issues (such as only being able to set the object to be segmented in the first frame).

I have run SAM2 tiny, small on a100 without any issue for real time application (25 fps +). Make sure flash-attention is installed though

Are you using the video predictor or just a single image predictor? I'm also looking for the way to make the sam2 in real 'real-time' with small input sample and track the stuff that I want in the live video (images) stream.

Thanks! I wonder if is it possible to make some improvements based on your work to realize, for example, add point prompts during the live video (not the first frame). The point prompt is changed through the stream.

Gy920 commented 4 weeks ago

Thanks for the input everyone. I've tested all suggestions separately and in combination, and while the tracking consistency improved, the results are definitely not on par with the original video predictor yet. They are noisier and sometimes just disappear after a while.

I've also made some changes in the meantime which gave minor improvements, but still not as good as the original.

Use bicubic interpolation for resizing

Reset the frame_idx together with the rest of the state

@Gy920 Can you elaborate on the simplifications you did? Any further tips and/or solutions are greatly appreciated.

Sorry for the late reply. I reviewed my simplified logic again and found that it wouldn't cause any impact on single object tracking.

The main reason for the issue, as @jinxianwei mentioned, is that self.condition_state['output_dict'] was not being saved, which would cause problems in this part.

Previously, I didn't save self.condition_state['output_dict'] because I noticed it caused a continuous increase in GPU memory after adding it. I re-added it and adjusted some of the logic, and now it doesn't cause a continuous increase in GPU memory.

I'm still trying to track down other performance-killing bugs, but I'm a bit swamped right now. If you spot any, feel free to send a PR my way!

Gy920 commented 4 weeks ago

I have implemented a demo of building SAM2 on a real-time application, which can be found at https://github.com/Gy920/segment-anything-2-real-time. However, there are still some issues (such as only being able to set the object to be segmented in the first frame).

I have run SAM2 tiny, small on a100 without any issue for real time application (25 fps +). Make sure flash-attention is installed though

Are you using the video predictor or just a single image predictor? I'm also looking for the way to make the sam2 in real 'real-time' with small input sample and track the stuff that I want in the live video (images) stream.

Thanks! I wonder if is it possible to make some improvements based on your work to realize, for example, add point prompts during the live video (not the first frame). The point prompt is changed through the stream.

@devinli123 You can continuously read the live video, and then add point prompts whenever you need them (instead of adding them all in the first frame).

jellevhb commented 4 weeks ago

Sorry for the late reply. I reviewed my simplified logic again and found that it wouldn't cause any impact on single object tracking.

The main reason for the issue, as @jinxianwei mentioned, is that self.condition_state['output_dict'] was not being saved, which would cause problems in this part.

Previously, I didn't save self.condition_state['output_dict'] because I noticed it caused a continuous increase in GPU memory after adding it. I re-added it and adjusted some of the logic, and now it doesn't cause a continuous increase in GPU memory.

I'm still trying to track down other performance-killing bugs, but I'm a bit swamped right now. If you spot any, feel free to send a PR my way!

No worries, thanks again for your time and nice work. Saving the 'output_dict' and using the bicubic interpolation already results in a big improvement. I'm also not sure where the rest of the discrepancies come from. It might be the jpeg encoding, as @lihaowei1999 mentioned before, although adding their code doesn't bring the results to the same level as the original yet. I will try to do some further tests when I find the time.

bhoo-git commented 1 week ago

Hello,

Thanks @Gy920 for the great project. I'm looking to use the camera predictor and use it such that a user can dynamically add a bounding box while the video is streaming. With that, I have a couple of questions:

To prevent OOM issues, would it make sense to have a logic where some of the values in self.condition_state are discarded?

Also, with regards to the comment about adding prompts dynamically:

@devinli123 You can continuously read the live video, and then add point prompts whenever you need them (instead of adding them all in the first frame).

I'm uncertain how the memory propagation would work in this case, does anybody have an example with adding a prompt in the middle of the video playing?

facebookresearch / segment-anything-2

Is it possible to use SAM2 for real-time video applications? #90