facebookresearch / sam2

The repository provides code for running inference with the Meta Segment Anything Model 2 (SAM 2), links for downloading the trained model checkpoints, and example notebooks that show how to use the model.
Apache License 2.0
12.75k stars 1.2k forks source link

Plans to update model weights. #363

Open GeorgiaA opened 1 month ago

GeorgiaA commented 1 month ago

Hello,

Are there any plans to update the model weights for the SAM2 models in this repo to match the weights used in the online demo?

I discussed with someone from Meta last week at ECCV2024 who said the SAM2 model used in the online demo is trained on the Meta dataset and many open-source datasets, however, the model released on GitHub is only trained on the Meta dataset.

I have found that the online demo works well at segmenting some underwater data that I'm using. Still, when using the notebooks provided in the repo, the SAM2 model cannot segment the objects of interest. I am interested in zero-shot segmentation approaches.

Thanks.

chayryali commented 1 month ago

Hi @GeorgiaA, It is not planned to release models trained on other data (e.g. using academic datasets) to be compliant with an Apache 2.0 license.

Are you able to share any examples? What prompt and model size/version are you using?

GeorgiaA commented 1 month ago

Hi @chayryali ah ok that makes sense.

Here is an example images/video that I am working with: Screenshot 2024-10-14 at 15 36 12

I am using the subpipe dataset. This is after the image has been preprocessed to make the picture clearer. I am interested in segmenting the dark grey part of the image (it is a subsea pipe).

When I add a short video to the SAM2 demo online it and give it a couple of points it does well at selecting the pipe in the image. Screenshot 2024-10-14 at 15 36 57

However, when I try the video_predictor_example.ipynb notebook supplied in this repo using sam2.1_hiera_l.yaml model config with sam2.1_hiera_large.pt checkpoint weights and pass in a couple of points it is unsuccessful in segmenting the pipe. Screenshot 2024-10-14 at 15 42 43

I am looking for a zero-shot approach as I do not have the computational power required to fine-tune SAM or SAM2. I am looking for a zero-shot approach as I want to build a pipeline to segment many different underwater objects, however I only have data for underwater pipes at the moment. Do you have any suggestions? It would be greatly appreciated!

heyoeyo commented 1 month ago

For what it's worth, the large model does seem to be able to segment the given image if a similar 2-point prompt is given:

segment_example

However, it appears in the last-most mask output and isn't necessarily the highest ranked by IoU prediction (especially if using the v2.1 model, which predicts a significantly lower IoU compared to v2). The mask can be cleaned up by re-running the same image repeatedly through the video processing part of the model (see issue #352) and that also seems to give a mask result with a high stability score, which may be a way to help automate the selection of the mask (i.e. the 'good' mask ends up with a high stability score after repeat encoding of the image as a video... sort of a weird processing pipeline to be fair).

GeorgiaA commented 1 month ago

@heyoeyo Thanks for this, that is really useful.

I am trying to use the tiny model as I want to try and implement this onto a small device, e.g. a NVIDIA Jetson Orin (although I'm not sure on how well this will perform). I think I will most likely need to fine-tune SAM 2 as I want SAM 2 to be able to segment many underwater objects, so I think fine-tuning it is going to be the way forward.

As a side note, has anyone tried fine-tuning SAM 2? I can get access to an NVIDIA A100 GPU with 80GB memory (as suggested in the training/fine-tuning README) through the cloud. However, my employer wants an estimated cost and I do not know how long it will take to fine-tune a SAM 2 model. I would be focusing on using the tiny model primarily. If anyone can give some examples it would be a big help so I could give an approximate costing (I know it will depend on the size of the data and then how many epochs it is trained for).