Open jiangxingxian opened 1 year ago
import cv2 image=cv2.read("Your File") image=cv2.resize(image,256*256)
That's it your image is resized. If any doubts feel free
you can see #338 but there is something wrong in it !
You shouldn't change the input image size for 2 reasons.
Finally, input image size is set in build_sam.py. More specifically: https://github.com/facebookresearch/segment-anything/blob/6fdee8f2727f4506cfbbe553e23b895e27956588/segment_anything/build_sam.py#L63
Changing this might break other parts of the code, so you would need to look through the source code.
Why are you changing the image size? Internally, SAM scales the image to 1024x1024 size. If you have memory issues, have you tried ViT-b checkpoint?
You shouldn't change the input image size for 2 reasons.
- Positional encoding expects input size of 1024x1024. This means any other size will lead to poor results, assuming that the code is properly written to accept different size inputs.
- SAM also uses relative positional encoding. Relative positional encoding is likely less susceptible to image size but still, the output is unpredictable.
Finally, input image size is set in build_sam.py. More specifically:
Changing this might break other parts of the code, so you would need to look through the source code.
Why are you changing the image size? Internally, SAM scales the image to 1024x1024 size. If you have memory issues, have you tried ViT-b checkpoint?
because the big image(5K) segment anything to get all masks will cost too much time !
As I said, internally, the model will resize your image to 1024x1024. After that, it will return your masks in original size(5K?). So that might be where you have slow speed. If so, resize your image to 1024x1024 before passing it to the model and then post process the masks yourself. You can pass 256x256 images to the model, which will upsample to 1024x1024. But, the model is trained at 1024x1024 resolution so the segmentation result will likely be poor.
This is however an existing problem with vision transformer models. The fixed patch size and static positional encodings forces the model into mostly static input size. So not much you can do there.
As I said, internally, the model will resize your image to 1024x1024. After that, it will return your masks in original size(5K?). So that might be where you have slow speed. If so, resize your image to 1024x1024 before passing it to the model and then post process the masks yourself. You can pass 256x256 images to the model, which will upsample to 1024x1024. But, the model is trained at 1024x1024 resolution so the segmentation result will likely be poor.
This is however an existing problem with vision transformer models. The fixed patch size and static positional encodings forces the model into mostly static input size. So not much you can do there.
yes, input size is static!
if you inference all masks, the input image will be cropped to a list images, and some points will be generate, the cropped image will be thrown into model. you can see
so 5K input image will cost more time !
As I said, internally, the model will resize your image to 1024x1024. After that, it will return your masks in original size(5K?). So that might be where you have slow speed. If so, resize your image to 1024x1024 before passing it to the model and then post process the masks yourself. You can pass 256x256 images to the model, which will upsample to 1024x1024. But, the model is trained at 1024x1024 resolution so the segmentation result will likely be poor.
This is however an existing problem with vision transformer models. The fixed patch size and static positional encodings forces the model into mostly static input size. So not much you can do there.
So, the static image size is not an issue with CNN backbone models like U-Net? Or any other CNN model.
I'm new to Computer Vision and my understanding is that almost all the models (CNN, Transformer and Hybrid) have the issue with image resolution. They are not adaptive to different resolutions. Right?
I think the problem can be solved.
I reviewed the code from modeling_sam.py
in huggingface transformer. Its config provides use_abs_pos
to use pretrained positional embedding or not.
If you worry about the performance, DinoV2 which also use ViT provides a solution. You can look at interpolate_pos_encoding
function in modeling_dinov2.py
also in huggingface. It actually "interpolate the pre-trained position encodings" to utilize the weights.
https://github.com/ByungKwanLee/Full-Segment-Anything addresses the ciritical issues of SAM, which supports batch-input on the full-grid prompt (automatic mask generation) with post-processing: removing duplicated or small regions and holes, under flexible input image size
I want to feed 256*256 images into the network training without changing the data size。 Thank you!