How can we perform segmentation in real-time?

facebookresearch / sam2

The repository provides code for running inference with the Meta Segment Anything Model 2 (SAM 2), links for downloading the trained model checkpoints, and example notebooks that show how to use the model.

Apache License 2.0

11.49k stars 997 forks source link

How can we perform segmentation in real-time? #60

Open CURRY-AND-RICE opened 2 months ago

CURRY-AND-RICE commented 2 months ago

Currently, I think we can only input video via separated frames stored in a directory. However, for online applications, we should be able to input frames sequentially as they come in. Are there any existing solutions to facilitate this? Additionally, are there plans to add such functionality in the future?

Thank you for amazing work!

rolson24 commented 2 months ago

I opened a PR that can run directly on a video file without extracting and loading all of the frames into memory at once, but it doesn't support a video stream. I would most likely require a large refactor of this repository's codebase to support a video stream, but I know huggingface are working to add the model to transformers, which may be able to support running on a stream.

CURRY-AND-RICE commented 2 months ago

Thank you for notifying me of such important information! I found an issue on hugginface for adding SAMv2 which is currently in progress. I will continue to explore ways to achieve stream inference and will keep this issue open.

Joao-Pimenta commented 2 months ago

@CURRY-AND-RICE Did you find a good implementation?

CURRY-AND-RICE commented 1 month ago

@Joao-Pimenta I've been unable to find an implementation that matches my needs. Maybe this will help. https://github.com/facebookresearch/segment-anything-2/issues/90

heyoeyo commented 1 month ago

Are there any existing solutions to facilitate this?

I have a basic example script that runs off videos (should work with webcams even), though it's not finalized and may be missing some features compared to the original video prediction implementation.

Edit: There's also now a UI version, which can also work on webcam: webcamanim