Open giulio333 opened 2 months ago
I have run SAM2 tiny, small on a100 without any issue for real time application (25 fps +). Make sure flash-attention is installed though
Thanks, what is flash-attention?
I have run SAM2 tiny, small on a100 without any issue for real time application (25 fps +). Make sure flash-attention is installed though
Hey, @thad75 can you please share how you were able to build SAM2 on a real-time application? Is there any Github that you have for this project?
I have run SAM2 tiny, small on a100 without any issue for real time application (25 fps +). Make sure flash-attention is installed though
Are you using the video predictor or just a single image predictor? I'm also looking for the way to make the sam2 in real 'real-time' with small input sample and track the stuff that I want in the live video (images) stream.
I have implemented a demo of building SAM2 on a real-time application, which can be found at https://github.com/Gy920/segment-anything-2-real-time. However, there are still some issues (such as only being able to set the object to be segmented in the first frame).
I have run SAM2 tiny, small on a100 without any issue for real time application (25 fps +). Make sure flash-attention is installed though
Are you using the video predictor or just a single image predictor? I'm also looking for the way to make the sam2 in real 'real-time' with small input sample and track the stuff that I want in the live video (images) stream.
@Gy920 Thanks for implementing Sam2CameraPredictor, I have a question is that I will dynamicly add objects points in each frame, and use Sam2CameraPredictor to track each object that I added. Is that possible for Sam2CameraPredictor to meet this requirement?
@Gy920 Thanks for implementing Sam2CameraPredictor, I have a question is that I will dynamicly add objects points in each frame, and use Sam2CameraPredictor to track each object that I added. Is that possible for Sam2CameraPredictor to meet this requirement?
You need to reset Sam2CameraPredictor when adding new objects, and then re-add your object points in that frame. This may cause some performance overhead.
https://github.com/IDEA-Research/Grounded-SAM-2
Can be expanded to use real-time.
https://github.com/IDEA-Research/Grounded-SAM-2/blob/main/grounded_sam2_tracking_demo.py
Really cool work by @IDEA-Research
I making some experiments to use it for real time video at
https://github.com/sachinsshetty/segment-anything-2/tree/experiments
its a WIP,
You need to reset Sam2CameraPredictor when adding new objects, and then re-add your object points in that frame. This may cause some performance overhead.
Thanks for the implementation @Gy920 . I've been working on tracking object using your code. But I noticed a drop in performance, as in when I use SAM's video predictor notebooks, the segmentations are accurate..but when I compare it with yours, there is a drop in performance..I would like to know if there is a logic change from original SAM's implementation and yours. Attaching the videos that I obtained from using video predictor notebook and camera sam also.
Video Predictor: here
Camera predictor: here
As you may see, the results from video predictor are consistent than camera predictor. Note: the prompt points are same for both videos.
You need to reset Sam2CameraPredictor when adding new objects, and then re-add your object points in that frame. This may cause some performance overhead.
Thanks for the implementation @Gy920 . I've been working on tracking object using your code. But I noticed a drop in performance, as in when I use SAM's video predictor notebooks, the segmentations are accurate..but when I compare it with yours, there is a drop in performance..I would like to know if there is a logic change from original SAM's implementation and yours. Attaching the videos that I obtained from using video predictor notebook and camera sam also.
Video Predictor: here
Camera predictor: here
As you may see, the results from video predictor are consistent than camera predictor. Note: the prompt points are same for both videos.
I noticed the colors are different between the two videos (R and B channels are swapped), as mentioned in this issue. The unsatisfactory result might be caused by this bug, which I'll fix later.
I have converted the frames from BGR to RGB before feeding them to predictor.load_first_frame
or predictor.track
. Still the results are same.
I have converted the frames from BGR to RGB before feeding them to
predictor.load_first_frame
orpredictor.track
. Still the results are same.
I suspect it might be related to the conditions in part. I simplified the logic in this part during implementation, which could potentially lead to a decrease in tracking performance.
I'm a bit swamped at the moment, but I'll carve out some time to dig into this issue and get it sorted as soon as possible.
@Gy920 Hi. I think I found the problem... Do not know if it is the same case on your side but it definitely worth a try. Add
cv2.imwrite("Temp/temp.jpg", img)
img = cv2.imread("Temp/temp.jpg")
to the front of perpare_data
and the segmentation will be much more stable.
Or simply apply:
img = cv2.imdecode(np.frombuffer(cv2.imencode('.jpg', img)[1], np.uint8), cv2.IMREAD_COLOR)
will be the same.
I think SAM2 is sensitive to the noise generated by jpg. If you change to png, the result will be not be as good.
You need to reset Sam2CameraPredictor when adding new objects, and then re-add your object points in that frame. This may cause some performance overhead.
Thanks for the implementation @Gy920 . I've been working on tracking object using your code. But I noticed a drop in performance, as in when I use SAM's video predictor notebooks, the segmentations are accurate..but when I compare it with yours, there is a drop in performance..I would like to know if there is a logic change from original SAM's implementation and yours. Attaching the videos that I obtained from using video predictor notebook and camera sam also. Video Predictor: here Camera predictor: here As you may see, the results from video predictor are consistent than camera predictor. Note: the prompt points are same for both videos.
I noticed the colors are different between the two videos (R and B channels are swapped), as mentioned in this issue. The unsatisfactory result might be caused by this bug, which I'll fix later.
I think it's result from the self.condition_state['output_dict'] dont save the result durning inference, and i change it find the result better, and all changes as follows
thanks @jinxianwei . Your fix has worked. The results are consistent now.
Thanks for the input everyone. I've tested all suggestions separately and in combination, and while the tracking consistency improved, the results are definitely not on par with the original video predictor yet. They are noisier and sometimes just disappear after a while.
I've also made some changes in the meantime which gave minor improvements, but still not as good as the original.
Use bicubic interpolation for resizing
Reset the frame_idx together with the rest of the state
@Gy920 Can you elaborate on the simplifications you did? Any further tips and/or solutions are greatly appreciated.
Hi all,
I've implemented a Grounded SAM2 for real-time video, inspired by @Gy920 and the @Grounded-SAM-2 project. This implementation supports natural language queries as input. Despite some latency, I hope it will be beneficial to the community.
I have implemented a demo of building SAM2 on a real-time application, which can be found at https://github.com/Gy920/segment-anything-2-real-time. However, there are still some issues (such as only being able to set the object to be segmented in the first frame).
I have run SAM2 tiny, small on a100 without any issue for real time application (25 fps +). Make sure flash-attention is installed though
Are you using the video predictor or just a single image predictor? I'm also looking for the way to make the sam2 in real 'real-time' with small input sample and track the stuff that I want in the live video (images) stream.
Thanks! I wonder if is it possible to make some improvements based on your work to realize, for example, add point prompts during the live video (not the first frame). The point prompt is changed through the stream.
Thanks for the input everyone. I've tested all suggestions separately and in combination, and while the tracking consistency improved, the results are definitely not on par with the original video predictor yet. They are noisier and sometimes just disappear after a while.
I've also made some changes in the meantime which gave minor improvements, but still not as good as the original.
- Use bicubic interpolation for resizing
- Reset the frame_idx together with the rest of the state
@Gy920 Can you elaborate on the simplifications you did? Any further tips and/or solutions are greatly appreciated.
Sorry for the late reply. I reviewed my simplified logic again and found that it wouldn't cause any impact on single object tracking.
The main reason for the issue, as @jinxianwei mentioned, is that self.condition_state['output_dict'] was not being saved, which would cause problems in this part.
Previously, I didn't save self.condition_state['output_dict'] because I noticed it caused a continuous increase in GPU memory after adding it. I re-added it and adjusted some of the logic, and now it doesn't cause a continuous increase in GPU memory.
I'm still trying to track down other performance-killing bugs, but I'm a bit swamped right now. If you spot any, feel free to send a PR my way!
I have implemented a demo of building SAM2 on a real-time application, which can be found at https://github.com/Gy920/segment-anything-2-real-time. However, there are still some issues (such as only being able to set the object to be segmented in the first frame).
I have run SAM2 tiny, small on a100 without any issue for real time application (25 fps +). Make sure flash-attention is installed though
Are you using the video predictor or just a single image predictor? I'm also looking for the way to make the sam2 in real 'real-time' with small input sample and track the stuff that I want in the live video (images) stream.
Thanks! I wonder if is it possible to make some improvements based on your work to realize, for example, add point prompts during the live video (not the first frame). The point prompt is changed through the stream.
@devinli123 You can continuously read the live video, and then add point prompts whenever you need them (instead of adding them all in the first frame).
Sorry for the late reply. I reviewed my simplified logic again and found that it wouldn't cause any impact on single object tracking.
The main reason for the issue, as @jinxianwei mentioned, is that self.condition_state['output_dict'] was not being saved, which would cause problems in this part.
Previously, I didn't save self.condition_state['output_dict'] because I noticed it caused a continuous increase in GPU memory after adding it. I re-added it and adjusted some of the logic, and now it doesn't cause a continuous increase in GPU memory.
I'm still trying to track down other performance-killing bugs, but I'm a bit swamped right now. If you spot any, feel free to send a PR my way!
No worries, thanks again for your time and nice work. Saving the 'output_dict' and using the bicubic interpolation already results in a big improvement. I'm also not sure where the rest of the discrepancies come from. It might be the jpeg encoding, as @lihaowei1999 mentioned before, although adding their code doesn't bring the results to the same level as the original yet. I will try to do some further tests when I find the time.
Hello,
Thanks @Gy920 for the great project. I'm looking to use the camera predictor and use it such that a user can dynamically add a bounding box while the video is streaming. With that, I have a couple of questions:
To prevent OOM issues, would it make sense to have a logic where some of the values in self.condition_state
are discarded?
Also, with regards to the comment about adding prompts dynamically:
@devinli123 You can continuously read the live video, and then add point prompts whenever you need them (instead of adding them all in the first frame).
I'm uncertain how the memory propagation would work in this case, does anybody have an example with adding a prompt in the middle of the video playing?
I am interested in using SAM2 for real-time video processing applications. Could you please let me know if this is possible with the current implementation?