Closed kaixin-bai closed 1 year ago
This is our official implementation of how to run mask prediction using the onnx model with multithreading in the browser and the precomputed image embedding. This should have ~50ms latency. https://github.com/facebookresearch/segment-anything/tree/main/demo
Please see the README in the demo folder for more details.
Yes, the problem is with the embedding. On my 3080 Ti, it takes ~1.5 seconds to compute an embedding for a 640x480 image. In the FAQ on the segment anything site, they say the embedding should take 0.15 seconds on A100 (suggesting an A100 does inference 10X faster than a 3080).
This isn't called out super-clearly in the paper, but they do mention a few times that the embedding encoder is "heavyweight". It appears it could be replaced with a "cheaper" encoder that outputs a CxWxH embedding, but I assume you'd have to retrain the model and the resulting performance might be a lot worse. Personally, i wish the embedding time was more explicitly called out in the paper.
I followed that web demo (https://github.com/facebookresearch/segment-anything/tree/main/demo) and still cannot get close to 50 ms. Running the onnx model consistently takes 90 and 100 ms. I timed it with the following to isolate just model inference latency:
console.time("run model")
const results = await model.run(feeds);
console.timeEnd("run model")
I'm using the quantized model, and I can see that onnx is using ort-wasm-simd-threaded.wasm
which I believe verifies that I'm using SharedArrayBuffers.
Is the demo missing something that would speed it up to 50 ms?
hi, we have proposed a method for rapid 'segment anything', using just 2% of the SA-1B dataset. It achieves precision comparable to SAM in edge detection (AP, .794 vs .793) and proposal generation tasks (mask AR@1000, 49.7 vs 51.8. E32). Additionally, our model is 50 times faster than SAM-H E32. The model is very simple, primarily adopting the yolov8seg structure. We welcome everyone to try it out, github: https://github.com/CASIA-IVA-Lab/FastSAM, arxiv: https://arxiv.org/pdf/2306.12156.pdf
In the paper it shows "The overall model design is largely motivated by efficiency. Given a precomputed image embedding, the prompt encoder and mask decoder run in a web browser, on CPU, in ∼ 50ms. This runtime performance enables seamless, real-time interactive prompting of our model."
But while testing the script "automatic_mask_generator_example" and the checkpoint "sam_vit_b_01ec64.pth" with gpu inference, it takes about 1.65s on average. Would it be possible to make the inference pipeline to take only 50ms?