NVIDIA-AI-IOT / nanosam

A distilled Segment Anything (SAM) model capable of running real-time with NVIDIA TensorRT
Apache License 2.0
616 stars 52 forks source link

image encoder image size is too big ,can reduce 1024 to 640/480 for acceleration? #10

Open Mediumcore opened 11 months ago

Mediumcore commented 11 months ago

image encoder image size is too big ,can reduce 1024 to 640/480 for acceleration?

jaybdub commented 10 months ago

Hi @Mediumcore ,

This would require distilling a new model, for this you may be able to follow these steps.

Disclaimer: I haven't tested these, so let me know if you run into issues.

Step 1 - Register a new model

Register a new model, and ensure that it outputs features of shape 256x64x64 for your desired input resolution. For example, for an input of size 512x512, your model must have an output stride of 8.

You can register the model similar to here:

https://github.com/NVIDIA-AI-IOT/nanosam/blob/653633614b2eb93b06ba3be9adb2aeffb117bd72/nanosam/models/timm_image_encoder.py#L78

You could try registering a new model with a different stride, and setting the student size to a lower resolution (ie: 512x512).

Step 2 - Train the distilled model

Next, you'll need to train the model on unlabeled images. Follow the training instructions in the README, but set the "student_size" parameter to the desired size (512).

https://github.com/NVIDIA-AI-IOT/nanosam/blob/653633614b2eb93b06ba3be9adb2aeffb117bd72/nanosam/tools/train.py#L33

Step 3 - Evaluate the distilled model

Follow the evaluation instructions in the README to compare the accuracy for small / medium / large objects.

As a note: It's worth noting that distillation only applies to the image encoder. It's worth benchmarking the mask decoder to see if this is worth it, as the image encoding speed is approaching the decoding speed and may no longer be a performance bottleneck.

Hope this helps. If we end up releasing a lower resolution model I will update this thread, but we have no current plans at the moment. John

Mediumcore commented 10 months ago

Understood,thank you very much for reply

limjh16 commented 1 month ago

Register a new model, and ensure that it outputs features of shape 256x64x64 for your desired input resolution. For example, for an input of size 512x512, your model must have an output stride of 8.

You can register the model similar to here:

https://github.com/NVIDIA-AI-IOT/nanosam/blob/653633614b2eb93b06ba3be9adb2aeffb117bd72/nanosam/models/timm_image_encoder.py#L78

You could try registering a new model with a different stride, and setting the student size to a lower resolution (ie: 512x512).

Hey there, could I just check which particular layer of convolution the output stride of 8 should be applied to? Also, where is the setting for the student size resolution? I am trying to get a model with an input size of 512x512.

https://github.com/NVIDIA-AI-IOT/nanosam/blob/653633614b2eb93b06ba3be9adb2aeffb117bd72/nanosam/models/timm_image_encoder.py#L42-L57