Open neko-para opened 6 months ago
stderr of the processing.
size: (1280, 720, 3)
{ pts: [ [ 707.1875, 55.8046875 ] ] }
sizp: 1280 720
(1, 1, 1280, 720)
The SAM model internally scales the input image to fit inside a 1024x1024 resolution and uses padding to fill out the missing space, which would be 'to the right' of your image in this case (to fill in the narrow side of the image). The mask decoder is supposed to remove this padding, which requires knowing the size of the original image (through the orig_im_size
input most likely).
In this case, it looks like the cropping has been flipped: the mask looks cropped at the bottom (judging by the misalignment of the mask with the search bar part of the image) instead of removing the padding on the right (which is why it looks like there's a gap on the right). The width of the displayed mask result also seems to confirm this. The original image would have taken up 56.25% (720/1280) of the width of the internal padded image and 56.25% of your displayed image (432px) is 243px, very close to the maximum expected width of the resulting mask if the padding isn't removed.
As for fixing it, I'm not very familiar with the onnx side of things, but maybe the orig_im_size
input needs to be made dynamic? Otherwise I would try flipping the order of the height & width values given as the orig_im_size
input.
@heyoeyo After switching orig_im_size, it seems that sam truncate the longer side with the same ratio. I've also tried pass [h, h] or [w, w], which doesn't work either.
switch width & height
pass height & height
pass width & width
Weird! It definitely seems like there's something wrong with the cropping and/or scaling of the mask result to remove the input padding. Swapping the width & height at least seems to fix the removal of the right-side padding (judging from the fact that the mask is horizontally aligned correctly), but it's clearly messing up the scaling still. Though looking at the onnx version of the model, the post-processing code looks ok to me...
As a sanity check, it might be worth manually handling the scaling/padding removal (using the low_res_logits
output of the model), just to be sure that the correct transformations are being done. The basic steps are:
low_res_logits
to 1024x1024predictor
should have stored this as predictor.input_size, in this case I think it should be: 1024x576Assuming the masks come out as an np.array, I think something like this should work:
# Show low-res mask result after upscaling
result_uint8 = np.uint8((low_res_logits.squeeze() > 0) * 255)
scaled_uint8 = cv2.resize(result_uint8, dsize=(1024,1024))
cv2.imshow("Scaled low-res result", result_uint8)
cv2.waitKey(250)
# Show result after removing padding
cropped_uint8 = scaled_uint8[0:1024, 0:576]
cv2.imshow("Cropped result", cropped_uint8)
cv2.waitKey(250)
# Show final mask scaled back to original size
final_uint8 = cv2.resize(cropped_uint8, dsize=(720,1280))
cv2.imshow("Final result", final_uint8)
# Show windows until a keypress occurs, then close them all
cv2.waitKey(0)
cv2.destroyAllWindows()
This should pop-up a bunch of windows to show the intermediate results. The mask will look worse, since the thresholding (>0 check) is happening before scaling, but it should at least give a sense of whether the mask is being cropped/scaled properly, or if something is wrong with the sizings.
Currently, my program can perfectly work on the demo pictures(e.g. images/truck.jpg). But when I switched to my own pngs, the result seems to be scaled incorrectly.
For instance, the shown size of image below is 432x770, while the result mask seems to be only in 245x770. The image been processed