czg1225 / SlimSAM

SlimSAM: 0.1% Data Makes Segment Anything Slim
Apache License 2.0
248 stars 14 forks source link

Batch Prompting for Multiple Boxes #14

Open Ashutosh1995 opened 3 months ago

Ashutosh1995 commented 3 months ago

Hi @czg1225

Thanks for this repo!

Can you please tell how can I do batch prompting by giving multiple boxes as prompts to the model?

Right now, predict function takes one numpy array of length 4

Also can you please clarify that for custom dataset training, does each image file should have a corresponding json or a single json for the whole folder is enough since I am getting this error. I have se gradsize to be 100 and 1000. My training images are 1306 and validation are 293.

Screenshot 2024-03-19 111709

Ashutosh1995 commented 3 months ago

Also in the training code, you are using point coordinates, what if I want to use boxes as prompts during custom dataset training?

czg1225 commented 3 months ago

Hi @Ashutosh1995 ,

  1. Our SlimSAM has the same inference workflow as the original SAM. You can refer to the original SAM for more details about batch prompting
  2. Each image should have a corresponding JSON, just like the SA-1B dataset. The gradsize should be smaller than the number of training images.
  3. If you want to use box prompts during training please refer to this issue https://github.com/czg1225/SlimSAM/issues/11. I have helped another user to solve this problem.
Ashutosh1995 commented 3 months ago

Thanks a lot @czg1225 for your reply! Let me have a look at this issue and see if I can replicate it on my dataset.

Ashutosh1995 commented 3 months ago

@czg1225 thanks for your help, I was able to train the model using box prompts. My custom dataset has 1306 training images and 293 validation images. During inference also, I supplied box prompts and my output looks like this

Screenshot 2024-03-21 143845

I wanted to segment the bottom card alone but in my instances I am getting the card on the top as well.

Can you please suggest what should I do to improve this?

czg1225 commented 3 months ago

Hi @Ashutosh1995 , I don't know which model you used for the inference phase. If you're using your own trained SlimSAM, I recommend you to further training for more iterations or reduce the pruning ratio. On the other hand, if you're employing our pre-trained SlimSAM, it would be advantageous to utilize more precise bounding boxes (A box that completely surrounds the bottom card). The current box prompt used may pose challenges in achieving the desired outcomes, a difficulty that might persist even when using the original SAM-H model.

Another possible solution is to use the prompt which combines the point and box. You can find more details about the combination prompt in SAM's notebook: https://github.com/facebookresearch/segment-anything/blob/main/notebooks/predictor_example.ipynb

Ashutosh1995 commented 3 months ago

Thanks for your reply @czg1225. With default settings, I trained SlimSAM on my custom dataset with 0.5 as pruning ratio. I was also thinking of reducing the pruning ratio and training for more iterations. How much would you recommend given that I have an image containing multiple stack of cards placed on top of each other?

I was also getting an idea that point prompt in this case would make more sense. Let me try that also with both the prompts.

czg1225 commented 3 months ago

Before you train the SlimSAM, I recommend you use the original SAM to infer with the same prompt and see if this bad case still exists. If the original SAM performs well under the same prompt then reducing the pruning ratio is useful. Conversely, if the original SAM also underperforms, reducing the pruning ratio may not lead to improvements.

Ashutosh1995 commented 3 months ago

This is the output from the SAM model using only the box prompt using ViT-H model.

Screenshot 2024-03-21 170842

Through SlimSAM, can we use both box and point prompts both to train the model as you had suggested earlier ?

I also tried using https://github.com/luca-medeiros/lightning-sam but I think the same issue will persist over there as well right?

czg1225 commented 3 months ago
  1. In fact, the prompts are only used in the validation phase in the training code. So you can adjust the validation phase of training code to infer with a combination of box and point prompts.
  2. I don't think there will be improvement using the lighting-sam. I recommend you to give the model a more accurate prompt (a box prompt completely surrounding the bottom card, or a combination of point and box prompts).
Ashutosh1995 commented 3 months ago

Thanks a lot @czg1225 for your valuable feedback! Let me adjust the data accordingly and get back to you. Please keep this thread open!

Ashutosh1995 commented 3 months ago

@czg1225 is it possible to train with both boxes and points and infer the model with just boxes and still attain good masks?

czg1225 commented 3 months ago

@Ashutosh1995 , the training process of SlimSAM conducts knowledge distillation on the image embedding of the image encoder. So the training process needs no prompts, only the validation phase needs prompts. As a result, you won't meet problems when infering with box prompt or point prompt.

Ashutosh1995 commented 3 months ago

That means @czg1225 that if I give the box prompt only or if I give the box + point prompt combined, the performance would remain the same?

czg1225 commented 3 months ago

@Ashutosh1995 In a nutshell, different prompts only affect the inference phase but not the training phase. So if you use a more accurate prompt, you may have a better result in the inference phase. But you do not need to think about the prompt when you train the SlimSAM.

Ashutosh1995 commented 3 months ago

@czg1225 thanks a lot for answering all the doubts I had. Let me also try with both the SlimSAM and Lightning-SAM model to see if both the point and box prompts help me in this case!