facebookresearch / sam2

The repository provides code for running inference with the Meta Segment Anything Model 2 (SAM 2), links for downloading the trained model checkpoints, and example notebooks that show how to use the model.
Apache License 2.0
12.35k stars 1.14k forks source link

Is it possible to produce masks on images without providing prompts? #438

Open eypros opened 6 days ago

eypros commented 6 days ago

I am trying to finetune (or train from scratch as an alternative) a Sam2 model with a custom training and eval set.

For simplicity (and to check that my model can actually overfit and, thus, work as expected) I use the same training and evaluation set (in other words I am evaluating on my training set).

I am basically following the approach of fine-tune-train_segment_anything_2_in_60_lines_of_code.

I introduced a second dataloader (which is not really elaborated after all) for the evaluation and have split the training into epochs (I omit some samples at each epoch which do not fully complete the final batch). So, after each epoch I run the evaluation code (for getting the IoU over the whole eval set). I noticed that using the train pipeline:

        predictor.set_image_batch([image]*batch_size)
        mask_input, unnorm_coords, labels, unnorm_box = predictor._prep_prompts(input_point, input_label, box=None,
                                                                                mask_logits=None,
                                                                                normalize_coords=True)
        sparse_embeddings, dense_embeddings = predictor.model.sam_prompt_encoder(points=(unnorm_coords, labels),
                                                                                 boxes=None, masks=None, )

        high_res_features = [feat_level[-1].unsqueeze(0) for feat_level in predictor._features["high_res_feats"]]
        low_res_masks, prd_scores, _, _ = predictor.model.sam_mask_decoder(
            image_embeddings=predictor._features["image_embed"],
            image_pe=predictor.model.sam_prompt_encoder.get_dense_pe(), sparse_prompt_embeddings=sparse_embeddings,
            dense_prompt_embeddings=dense_embeddings, multimask_output=True, repeat_image=False,
            high_res_features=high_res_features, )
        # Upscale the masks to the original image resolution
        prd_masks = predictor._transforms.postprocess_masks(low_res_masks, predictor._orig_hw[-1])
        prd_mask = torch.sigmoid(prd_masks[:, 0]) ]()

and the code from the inference pipeline:

            predictor.set_image(image)
            masks, _, _ = predictor.predict(point_coords=None, box=None, multimask_output=False)

produced quite different results.

a) I was wondering if this has only to do with the provided prompts (which are present in the training pipeline but not in the inference one) or there is some more differences (due to architecture etc for example).

b) I noticed that the prediction is highly affected by the provided prompts. My needs is to apply Sam2 on unprompted images and thus I can not provide any prompt. Does this mean that I cannot use Sam2 without provided prompts and I should resolve to a different model for example?

catalys1 commented 16 hours ago

SAM2 does require prompts. There is an automatic mask generator available, which just adds a bunch of prompts on a grid and does some post processing, if you want to look into that.