Is it possible to produce masks on images without providing prompts?

I am trying to finetune (or train from scratch as an alternative) a Sam2 model with a custom training and eval set.

For simplicity (and to check that my model can actually overfit and, thus, work as expected) I use the same training and evaluation set (in other words I am evaluating on my training set).

I am basically following the approach of fine-tune-train_segment_anything_2_in_60_lines_of_code.

I introduced a second dataloader (which is not really elaborated after all) for the evaluation and have split the training into epochs (I omit some samples at each epoch which do not fully complete the final batch). So, after each epoch I run the evaluation code (for getting the IoU over the whole eval set). I noticed that using the train pipeline:

        predictor.set_image_batch([image]*batch_size)
        mask_input, unnorm_coords, labels, unnorm_box = predictor._prep_prompts(input_point, input_label, box=None,
                                                                                mask_logits=None,
                                                                                normalize_coords=True)
        sparse_embeddings, dense_embeddings = predictor.model.sam_prompt_encoder(points=(unnorm_coords, labels),
                                                                                 boxes=None, masks=None, )

        high_res_features = [feat_level[-1].unsqueeze(0) for feat_level in predictor._features["high_res_feats"]]
        low_res_masks, prd_scores, _, _ = predictor.model.sam_mask_decoder(
            image_embeddings=predictor._features["image_embed"],
            image_pe=predictor.model.sam_prompt_encoder.get_dense_pe(), sparse_prompt_embeddings=sparse_embeddings,
            dense_prompt_embeddings=dense_embeddings, multimask_output=True, repeat_image=False,
            high_res_features=high_res_features, )
        # Upscale the masks to the original image resolution
        prd_masks = predictor._transforms.postprocess_masks(low_res_masks, predictor._orig_hw[-1])
        prd_mask = torch.sigmoid(prd_masks[:, 0]) ]()

and the code from the inference pipeline:

            predictor.set_image(image)
            masks, _, _ = predictor.predict(point_coords=None, box=None, multimask_output=False)

produced quite different results.

a) I was wondering if this has only to do with the provided prompts (which are present in the training pipeline but not in the inference one) or there is some more differences (due to architecture etc for example).

b) I noticed that the prediction is highly affected by the provided prompts. My needs is to apply Sam2 on unprompted images and thus I can not provide any prompt. Does this mean that I cannot use Sam2 without provided prompts and I should resolve to a different model for example?

facebookresearch / sam2

Is it possible to produce masks on images without providing prompts? #438