π Usage - β±οΈ Performance - π οΈ Setup - π€Έ Examples - ποΈ Training
- π§ Evaluation - π Acknowledgment - π See also
NanoSAM is a Segment Anything (SAM) model variant that is capable of running in π₯ real-time π₯ on NVIDIA Jetson Orin Platforms with NVIDIA TensorRT.
NanoSAM is trained by distilling the MobileSAM image encoder on unlabeled images. For an introduction to knowledge distillation, we recommend checking out this tutorial.
Using NanoSAM from Python looks like this
from nanosam.utils.predictor import Predictor
predictor = Predictor(
image_encoder="data/resnet18_image_encoder.engine",
mask_decoder="data/mobile_sam_mask_decoder.engine"
)
image = PIL.Image.open("dog.jpg")
predictor.set_image(image)
mask, _, _ = predictor.predict(np.array([[x, y]]), np.array([1]))
Follow the instructions below for how to build the engine files.
NanoSAM runs real-time on Jetson Orin Nano.
Model β | :stopwatch: Jetson Orin Nano (ms) | :stopwatch: Jetson AGX Orin (ms) | :dart: Accuracy (mIoU) β‘ | |||||
---|---|---|---|---|---|---|---|---|
Image Encoder | Full Pipeline | Image Encoder | Full Pipeline | All | Small | Medium | Large | |
MobileSAM | TBD | 146 | 35 | 39 | 0.728 | 0.658 | 0.759 | 0.804 |
NanoSAM (ResNet18) | TBD | 27 | 4.2 | 8.1 | 0.706 | 0.624 | 0.738 | 0.796 |
NanoSAM is fairly easy to get started with.
Install the dependencies
Install PyTorch
Install torch2trt
Install NVIDIA TensorRT
(optional) Install TRTPose - For the pose example.
git clone https://github.com/NVIDIA-AI-IOT/trt_pose
cd trt_pose
python3 setup.py develop --user
(optional) Install the Transformers library - For the OWL ViT example.
python3 -m pip install transformers
Install the NanoSAM Python package
git clone https://github.com/NVIDIA-AI-IOT/nanosam
cd nanosam
python3 setup.py develop --user
Build the TensorRT engine for the mask decoder
Export the MobileSAM mask decoder ONNX file (or download directly from here)
python3 -m nanosam.tools.export_sam_mask_decoder_onnx \
--model-type=vit_t \
--checkpoint=assets/mobile_sam.pt \
--output=data/mobile_sam_mask_decoder.onnx
Build the TensorRT engine
trtexec \
--onnx=data/mobile_sam_mask_decoder.onnx \
--saveEngine=data/mobile_sam_mask_decoder.engine \
--minShapes=point_coords:1x1x2,point_labels:1x1 \
--optShapes=point_coords:1x1x2,point_labels:1x1 \
--maxShapes=point_coords:1x10x2,point_labels:1x10
This assumes the mask decoder ONNX file is downloaded to
data/mobile_sam_mask_decoder.onnx
Build the TensorRT engine for the NanoSAM image encoder
Download the image encoder: resnet18_image_encoder.onnx
Build the TensorRT engine
trtexec \
--onnx=data/resnet18_image_encoder.onnx \
--saveEngine=data/resnet18_image_encoder.engine \
--fp16
Run the basic usage example
python3 examples/basic_usage.py \
--image_encoder=data/resnet18_image_encoder.engine \
--mask_decoder=data/mobile_sam_mask_decoder.engine
This outputs a result to
data/basic_usage_out.jpg
That's it! From there, you can read the example code for examples on how to use NanoSAM with Python. Or try running the more advanced examples below.
NanoSAM can be applied in many creative ways.
This example uses a known image with a fixed bounding box to control NanoSAM segmentation.
To run the example, call
python3 examples/basic_usage.py \
--image_encoder="data/resnet18_image_encoder.engine" \
--mask_decoder="data/mobile_sam_mask_decoder.engine"
This example demonstrates using OWL-ViT to detect objects using a text prompt(s), and then segmenting these objects using NanoSAM.
To run the example, call
python3 examples/segment_from_owl.py \
--prompt="A tree" \
--image_encoder="data/resnet18_image_encoder.engine" \
--mask_decoder="data/mobile_sam_mask_decoder.engine
This example demonstrates how to use human pose keypoints from TRTPose to control NanoSAM segmentation.
To run the example, call
python3 examples/segment_from_pose.py
This will save an output figure to data/segment_from_pose_out.png
.
This example demonstrates how to use human pose to control segmentation on a live camera feed. This example requires an attached display and camera.
To run the example, call
python3 examples/demo_pose_tshirt.py
This example demonstrates a rudimentary segmentation tracking with NanoSAM. This example requires an attached display and camera.
To run the example, call
python3 examples/demo_click_segment_track.py <image_encoder_engine> <mask_decoder_engine>
Once the example is running double click an object you want to track.
You can train NanoSAM on a single GPU
Download and extract the COCO 2017 train images
# mkdir -p data/coco # uncomment if it doesn't exist
mkdir -p data/coco
cd data/coco
wget http://images.cocodataset.org/zips/train2017.zip
unzip train2017.zip
cd ../..
Build the MobileSAM image encoder (used as teacher model)
Export to ONNX
python3 -m nanosam.tools.export_sam_image_encoder_onnx \
--checkpoint="assets/mobile_sam.pt" \
--output="data/mobile_sam_image_encoder_bs16.onnx" \
--model_type=vit_t \
--batch_size=16
Build the TensorRT engine with batch size 16
trtexec \
--onnx=data/mobile_sam_image_encoder_bs16.onnx \
--shapes=image:16x3x1024x1024 \
--saveEngine=data/mobile_sam_image_encoder_bs16.engine
Train the NanoSAM image encoder by distilling MobileSAM
python3 -m nanosam.tools.train \
--images=data/coco/train2017 \
--output_dir=data/models/resnet18 \
--model_name=resnet18 \
--teacher_image_encoder_engine=data/mobile_sam_image_encoder_bs16.engine \
--batch_size=16
Export the trained NanoSAM image encoder to ONNX
python3 -m nanosam.tools.export_image_encoder_onnx \
--model_name=resnet18 \
--checkpoint="data/models/resnet18/checkpoint.pth" \
--output="data/resnet18_image_encoder.onnx"
You can then build the TensorRT engine as detailed in the getting started section.
You can reproduce the accuracy results above by evaluating against COCO ground truth masks
Download and extract the COCO 2017 validation set.
# mkdir -p data/coco # uncomment if it doesn't exist
cd data/coco
wget http://images.cocodataset.org/zips/val2017.zip
wget http://images.cocodataset.org/annotations/annotations_trainval2017.zip
unzip val2017.zip
unzip annotations_trainval2017.zip
cd ../..
Compute the IoU of NanoSAM mask predictions against the ground truth COCO mask annotation.
python3 -m nanosam.tools.eval_coco \
--coco_root=data/coco/val2017 \
--coco_ann=data/coco/annotations/instances_val2017.json \
--image_encoder=data/resnet18_image_encoder.engine \
--mask_decoder=data/mobile_sam_mask_decoder.engine \
--output=data/resnet18_coco_results.json
This uses the COCO ground-truth bounding boxes as inputs to NanoSAM
Compute the average IoU over a selected category or size
python3 -m nanosam.tools.compute_eval_coco_metrics \
data/efficientvit_b0_coco_results.json \
--size="all"
This project is enabled by the great projects below.