facebookresearch / segment-anything

The repository provides code for running inference with the SegmentAnything Model (SAM), links for downloading the trained model checkpoints, and example notebooks that show how to use the model.
Apache License 2.0
46.79k stars 5.54k forks source link

How do I change 1024*1024 image_size to something else, like 256*256 #444

Open jiangxingxian opened 1 year ago

jiangxingxian commented 1 year ago

I want to feed 256*256 images into the network training without changing the data size。 Thank you!

hegdeadithyak commented 1 year ago

import cv2 image=cv2.read("Your File") image=cv2.resize(image,256*256)

That's it your image is resized. If any doubts feel free

onefish51 commented 1 year ago

you can see #338 but there is something wrong in it !

TemugeB commented 1 year ago

You shouldn't change the input image size for 2 reasons.

  1. Positional encoding expects input size of 1024x1024. This means any other size will lead to poor results, assuming that the code is properly written to accept different size inputs.
  2. SAM also uses relative positional encoding. Relative positional encoding is likely less susceptible to image size but still, the output is unpredictable.

Finally, input image size is set in build_sam.py. More specifically: https://github.com/facebookresearch/segment-anything/blob/6fdee8f2727f4506cfbbe553e23b895e27956588/segment_anything/build_sam.py#L63

Changing this might break other parts of the code, so you would need to look through the source code.

Why are you changing the image size? Internally, SAM scales the image to 1024x1024 size. If you have memory issues, have you tried ViT-b checkpoint?

onefish51 commented 1 year ago

You shouldn't change the input image size for 2 reasons.

  1. Positional encoding expects input size of 1024x1024. This means any other size will lead to poor results, assuming that the code is properly written to accept different size inputs.
  2. SAM also uses relative positional encoding. Relative positional encoding is likely less susceptible to image size but still, the output is unpredictable.

Finally, input image size is set in build_sam.py. More specifically:

https://github.com/facebookresearch/segment-anything/blob/6fdee8f2727f4506cfbbe553e23b895e27956588/segment_anything/build_sam.py#L63

Changing this might break other parts of the code, so you would need to look through the source code.

Why are you changing the image size? Internally, SAM scales the image to 1024x1024 size. If you have memory issues, have you tried ViT-b checkpoint?

because the big image(5K) segment anything to get all masks will cost too much time !

TemugeB commented 1 year ago

As I said, internally, the model will resize your image to 1024x1024. After that, it will return your masks in original size(5K?). So that might be where you have slow speed. If so, resize your image to 1024x1024 before passing it to the model and then post process the masks yourself. You can pass 256x256 images to the model, which will upsample to 1024x1024. But, the model is trained at 1024x1024 resolution so the segmentation result will likely be poor.

This is however an existing problem with vision transformer models. The fixed patch size and static positional encodings forces the model into mostly static input size. So not much you can do there.

onefish51 commented 1 year ago

As I said, internally, the model will resize your image to 1024x1024. After that, it will return your masks in original size(5K?). So that might be where you have slow speed. If so, resize your image to 1024x1024 before passing it to the model and then post process the masks yourself. You can pass 256x256 images to the model, which will upsample to 1024x1024. But, the model is trained at 1024x1024 resolution so the segmentation result will likely be poor.

This is however an existing problem with vision transformer models. The fixed patch size and static positional encodings forces the model into mostly static input size. So not much you can do there.

yes, input size is static!

if you inference all masks, the input image will be cropped to a list images, and some points will be generate, the cropped image will be thrown into model. you can see

https://github.com/facebookresearch/segment-anything/blob/6fdee8f2727f4506cfbbe553e23b895e27956588/segment_anything/automatic_mask_generator.py#L137

so 5K input image will cost more time !

savanth94 commented 1 year ago

As I said, internally, the model will resize your image to 1024x1024. After that, it will return your masks in original size(5K?). So that might be where you have slow speed. If so, resize your image to 1024x1024 before passing it to the model and then post process the masks yourself. You can pass 256x256 images to the model, which will upsample to 1024x1024. But, the model is trained at 1024x1024 resolution so the segmentation result will likely be poor.

This is however an existing problem with vision transformer models. The fixed patch size and static positional encodings forces the model into mostly static input size. So not much you can do there.

So, the static image size is not an issue with CNN backbone models like U-Net? Or any other CNN model.

I'm new to Computer Vision and my understanding is that almost all the models (CNN, Transformer and Hybrid) have the issue with image resolution. They are not adaptive to different resolutions. Right?

Starlento commented 1 year ago

I think the problem can be solved. I reviewed the code from modeling_sam.py in huggingface transformer. Its config provides use_abs_pos to use pretrained positional embedding or not. If you worry about the performance, DinoV2 which also use ViT provides a solution. You can look at interpolate_pos_encoding function in modeling_dinov2.py also in huggingface. It actually "interpolate the pre-trained position encodings" to utilize the weights.

ByungKwanLee commented 11 months ago

https://github.com/ByungKwanLee/Full-Segment-Anything addresses the ciritical issues of SAM, which supports batch-input on the full-grid prompt (automatic mask generation) with post-processing: removing duplicated or small regions and holes, under flexible input image size