Closed ankandrew closed 1 month ago
Haven't used it on CPU, but got some numbers on GPU inference
Running TensorRT on 3090 gives me around 3400 infers/second for 140x70. Benchmarking was done on Triton Inference Server via gRPC, no post-processing was done in that case. Running natively will probably result in higher throughput.
ONNX CUDA execution takes around 0.8ms on 3090 resulting in around 1.200 infers/second, according to benchmark provided by fast-plate-ocr
upd: I could do benchmark on CPU execution with Triton Inference Server if someone's interested. Threadripper 2990wx would probably give some interesting results
@VitalyVaryvdin Great numbers w/ TensorRT! Some questions:
fast-plate-ocr
benchmark with with TensorrtExecutionProvider
, did it give similar results to ≈ 0.294 ms you experienced above (on the 3090)? i.e. running:
from fast_plate_ocr import ONNXPlateRecognizer
m = ONNXPlateRecognizer('argentinian-plates-cnn-model', device='auto') m.benchmark(include_processing=False)
2. Did you try FP16 precision on the GPU, and did it give any better results?
If you want to try NCNN on your `Threadripper 2990wx` you can try and share the following:
<details>
<summary>NCNN quick benchmark</summary>
1. Get the NCNN model files from [ncnn_model.zip](https://github.com/ankandrew/fast-plate-ocr/files/15323713/ncnn_model.zip).
2. Install `pip install ncnn` with the virtual env that has the fast-plate-ocr package
3. Run the following code
```python
import cv2
import ncnn
from fast_plate_ocr.common.utils import measure_time
# TODO: Replace with your proper paths
SAMPLE_IMG = "/Users/anka/PycharmProjects/fast-plate-ocr/assets/benchmark/imgs/morning_1710_41.png"
NCNN_PARAM = "/Users/anka/Downloads/float_arg_cnn_ocr-sim-opt.param"
NCNN_MODEL = "/Users/anka/Downloads/float_arg_cnn_ocr-sim-opt.bin"
class CnnOcrNCNN:
def __init__(self, height=70, width=140, num_threads=1, use_gpu=False):
self.height = height
self.width = width
self.num_threads = num_threads
self.use_gpu = use_gpu
self.net = ncnn.Net()
self.net.opt.use_vulkan_compute = self.use_gpu
self.net.opt.num_threads = self.num_threads
self.net.load_param(NCNN_PARAM)
self.net.load_model(NCNN_MODEL)
def __del__(self):
self.net = None
def __call__(self, img):
img_h = img.shape[0]
img_w = img.shape[1]
mat_in = ncnn.Mat.from_pixels_resize(
img,
ncnn.Mat.PixelType.PIXEL_GRAY,
img_w,
img_h,
self.width,
self.height,
)
ex = self.net.create_extractor()
with measure_time() as time_taken:
ex.input("input", mat_in)
_, _ = ex.extract("concatenate")
return time_taken()
if __name__ == "__main__":
import statistics
m = cv2.imread(SAMPLE_IMG, cv2.IMREAD_GRAYSCALE)
for num_thread in range(1, 7):
net = CnnOcrNCNN(height=70, width=140, num_threads=num_thread, use_gpu=False)
times = [net(m) for _ in range(10_000)]
print(f"With num_threads={num_thread} -> {statistics.mean(times)}")
And it would be interesting to see how your CPU performs with ONNX CPU accelerator!
Long read ahead!
I've actually used fp16 model with Triton, haven't tried fp32. Forgot to mention my complete model config is as follows: 9 slots, 70x140
Got some measurements on ONNX via fast-plate-ocr benchmark using default settings
┏━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Executor ┃ Average ms ┃ Plates/second ┃
┡━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ CPUExecutionProvider │ 1.8616 │ 537.1779 │
└──────────────────────┴────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Executor ┃ Average ms ┃ Plates/second ┃
┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ CUDAExecutionProvider │ 0.9503 │ 1052.2600 │
└───────────────────────┴────────────┴───────────────┘
I wasn't able to test TRT execution with ONNX since I have hard time installing all the required TensorRT dependencies. NVIDIA at its best once again, apart from cluttered documentation scattered across different sites and pages, it seems to be impossible to pip install TensorRT 8.6.1 which is the latest version currently supported by ONNX - it just results in bunch of missing dependencies. The only way is tarball distribution
In order to use CUDAExecutionProvider with CU12 I have to install onnxruntime-gpu from Azure index, onnxruntime-gpu from pypi wants CU11
Latest ONNX version:
ONNX Runtime TensorRT CUDA
1.17 8.6 11.8, 12.2
I could install CU11, but ONNX wants 11.8 and pip only installs 11.7, thus also means tarball distribution seems to be the only option. And even in that case, after installing all that manually, there's no way to install missing python dependencies wanted by TensorRT 8.6.1
So, I guess I have to wait until ONNX 1.18 is released
Until then, I can only provide measurements done by trtexec (no processing):
fp32 - 0.223706 ms
fp16 - 0.191187 ms
These are the lowest possible numbers on my hardware so far
I've also ran benchmark on default argentinian-plates-cnn-model
model
┏━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Executor ┃ Average ms ┃ Plates/second ┃
┡━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ CPUExecutionProvider │ 1.6551 │ 604.1800 │
└──────────────────────┴────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Executor ┃ Average ms ┃ Plates/second ┃
┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ CUDAExecutionProvider │ 0.8609 │ 1161.5131 │
└───────────────────────┴────────────┴───────────────┘
trtexec measurements:
fp32 - 0.205371 ms
fp16 - 0.176831 ms
I wasn't able to use ncnn, the script you shared above results in module 'ncnn' has no attribute 'Net'
ncnn version: 1.0.20240410
Had no more time to tinker with it today, sorry.
Thanks for sharing those numbers! TensorRT with FP16 numbers are very promising - and yea installation for that accelerator with ONNX can be little pain.
Regarding the NCNN I couldn't reproduce with ncnn version: 1.0.20240410
, created this colab. Anyway, don't mind much about it, idk if it's work implementing NCNN backend for inference. I don't like a lot their documentation.
I don't think it is worth spending time on anything else beside ONNX tbh. It provides all possible acceleration ways out of the box or could be easily converted for use in different framework most of the time CPU workload would probably end up being ran on either OpenVINO/oneDNN/default anyway
Also, getting latencies as low as 0.2ms with TensorRT there's not much profit going lower, since pre & post processing would take much more time, I assume lower latencies would be noticeable only at really high scales
I agree with what you are saying. I had NCNN more in mind to accelerate CPU, but it might consume more memory and vary across different ecosystem. I will stay with ONNX for now for backend inference. Closing this, and thanks for providing those numbers!
Feel free to suggest any other framework to use for doing the inference. I've considered NCNN, with
num_threads>=2
seems to give better results on my Mac M1 CPU thanonnxruntime
withCPUExecutionProvider
.ONNX
Using the default settings for the benchmark, as pointed out here.
NCNN
So with
NCNN
, in the last case it could process approx. 1382 plates/second. Let me know if this is useful, for some extra ms boost.