ML Frameworks for Inference

ankandrew commented 1 month ago

Feel free to suggest any other framework to use for doing the inference. I've considered NCNN, with num_threads>=2 seems to give better results on my Mac M1 CPU than onnxruntime with CPUExecutionProvider.

ONNX

Using the default settings for the benchmark, as pointed out here.

┏━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃       Executor       ┃ Average ms ┃ Plates/second ┃
┡━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ CPUExecutionProvider │   1.9000   │   526.3271    │
└──────────────────────┴────────────┴───────────────┘

NCNN

num_threads	Inference Time
1	2.393375668108638
2	1.3126576988041052
3	1.0058014461610583
4	0.827979252829391
5	0.8164887356135295
6	0.7231321594546898

So with NCNN, in the last case it could process approx. 1382 plates/second. Let me know if this is useful, for some extra ms boost.

VitalyVaryvdin commented 1 month ago

Haven't used it on CPU, but got some numbers on GPU inference

Running TensorRT on 3090 gives me around 3400 infers/second for 140x70. Benchmarking was done on Triton Inference Server via gRPC, no post-processing was done in that case. Running natively will probably result in higher throughput.

ONNX CUDA execution takes around 0.8ms on 3090 resulting in around 1.200 infers/second, according to benchmark provided by fast-plate-ocr

upd: I could do benchmark on CPU execution with Triton Inference Server if someone's interested. Threadripper 2990wx would probably give some interesting results

ankandrew commented 1 month ago

@VitalyVaryvdin Great numbers w/ TensorRT! Some questions:

Running fast-plate-ocr benchmark with with TensorrtExecutionProvider, did it give similar results to ≈ 0.294 ms you experienced above (on the 3090)? i.e. running:
```
from fast_plate_ocr import ONNXPlateRecognizer
```

m = ONNXPlateRecognizer('argentinian-plates-cnn-model', device='auto') m.benchmark(include_processing=False)

2. Did you try FP16 precision on the GPU, and did it give any better results?

If you want to try NCNN on your `Threadripper 2990wx` you can try and share the following:

<details>
  <summary>NCNN quick benchmark</summary>

  1. Get the NCNN model files from [ncnn_model.zip](https://github.com/ankandrew/fast-plate-ocr/files/15323713/ncnn_model.zip).
  2. Install `pip install ncnn` with the virtual env that has the fast-plate-ocr package
  3. Run the following code
  ```python
  import cv2
  import ncnn

  from fast_plate_ocr.common.utils import measure_time

  # TODO: Replace with your proper paths
  SAMPLE_IMG = "/Users/anka/PycharmProjects/fast-plate-ocr/assets/benchmark/imgs/morning_1710_41.png"
  NCNN_PARAM = "/Users/anka/Downloads/float_arg_cnn_ocr-sim-opt.param"
  NCNN_MODEL = "/Users/anka/Downloads/float_arg_cnn_ocr-sim-opt.bin"

  class CnnOcrNCNN:
      def __init__(self, height=70, width=140, num_threads=1, use_gpu=False):
          self.height = height
          self.width = width
          self.num_threads = num_threads
          self.use_gpu = use_gpu

          self.net = ncnn.Net()
          self.net.opt.use_vulkan_compute = self.use_gpu
          self.net.opt.num_threads = self.num_threads

          self.net.load_param(NCNN_PARAM)
          self.net.load_model(NCNN_MODEL)

      def __del__(self):
          self.net = None

      def __call__(self, img):
          img_h = img.shape[0]
          img_w = img.shape[1]

          mat_in = ncnn.Mat.from_pixels_resize(
              img,
              ncnn.Mat.PixelType.PIXEL_GRAY,
              img_w,
              img_h,
              self.width,
              self.height,
          )
          ex = self.net.create_extractor()
          with measure_time() as time_taken:
              ex.input("input", mat_in)
              _, _ = ex.extract("concatenate")
          return time_taken()

  if __name__ == "__main__":
      import statistics

      m = cv2.imread(SAMPLE_IMG, cv2.IMREAD_GRAYSCALE)
      for num_thread in range(1, 7):
          net = CnnOcrNCNN(height=70, width=140, num_threads=num_thread, use_gpu=False)
          times = [net(m) for _ in range(10_000)]
          print(f"With num_threads={num_thread} -> {statistics.mean(times)}")

And it would be interesting to see how your CPU performs with ONNX CPU accelerator!

VitalyVaryvdin commented 1 month ago

Long read ahead!

I've actually used fp16 model with Triton, haven't tried fp32. Forgot to mention my complete model config is as follows: 9 slots, 70x140

Got some measurements on ONNX via fast-plate-ocr benchmark using default settings

┏━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃       Executor       ┃ Average ms ┃ Plates/second ┃
┡━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ CPUExecutionProvider │   1.8616   │   537.1779    │
└──────────────────────┴────────────┴───────────────┘

┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃       Executor        ┃ Average ms ┃ Plates/second ┃
┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ CUDAExecutionProvider │   0.9503   │   1052.2600   │
└───────────────────────┴────────────┴───────────────┘

I wasn't able to test TRT execution with ONNX since I have hard time installing all the required TensorRT dependencies. NVIDIA at its best once again, apart from cluttered documentation scattered across different sites and pages, it seems to be impossible to pip install TensorRT 8.6.1 which is the latest version currently supported by ONNX - it just results in bunch of missing dependencies. The only way is tarball distribution

In order to use CUDAExecutionProvider with CU12 I have to install onnxruntime-gpu from Azure index, onnxruntime-gpu from pypi wants CU11

Latest ONNX version:

ONNX Runtime    TensorRT    CUDA
1.17        8.6     11.8, 12.2

I could install CU11, but ONNX wants 11.8 and pip only installs 11.7, thus also means tarball distribution seems to be the only option. And even in that case, after installing all that manually, there's no way to install missing python dependencies wanted by TensorRT 8.6.1

So, I guess I have to wait until ONNX 1.18 is released

Until then, I can only provide measurements done by trtexec (no processing):

fp32 - 0.223706 ms
fp16 - 0.191187 ms

These are the lowest possible numbers on my hardware so far

I've also ran benchmark on default argentinian-plates-cnn-model model

┏━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃       Executor       ┃ Average ms ┃ Plates/second ┃
┡━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ CPUExecutionProvider │   1.6551   │   604.1800    │
└──────────────────────┴────────────┴───────────────┘

┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃       Executor        ┃ Average ms ┃ Plates/second ┃
┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ CUDAExecutionProvider │   0.8609   │   1161.5131   │
└───────────────────────┴────────────┴───────────────┘

trtexec measurements:

fp32 - 0.205371 ms
fp16 - 0.176831 ms

I wasn't able to use ncnn, the script you shared above results in module 'ncnn' has no attribute 'Net' ncnn version: 1.0.20240410 Had no more time to tinker with it today, sorry.

ankandrew commented 1 month ago

Thanks for sharing those numbers! TensorRT with FP16 numbers are very promising - and yea installation for that accelerator with ONNX can be little pain.

Regarding the NCNN I couldn't reproduce with ncnn version: 1.0.20240410, created this colab. Anyway, don't mind much about it, idk if it's work implementing NCNN backend for inference. I don't like a lot their documentation.

VitalyVaryvdin commented 1 month ago

I don't think it is worth spending time on anything else beside ONNX tbh. It provides all possible acceleration ways out of the box or could be easily converted for use in different framework most of the time CPU workload would probably end up being ran on either OpenVINO/oneDNN/default anyway

Also, getting latencies as low as 0.2ms with TensorRT there's not much profit going lower, since pre & post processing would take much more time, I assume lower latencies would be noticeable only at really high scales

ankandrew commented 1 month ago

I agree with what you are saying. I had NCNN more in mind to accelerate CPU, but it might consume more memory and vary across different ecosystem. I will stay with ONNX for now for backend inference. Closing this, and thanks for providing those numbers!

ankandrew / fast-plate-ocr

ML Frameworks for Inference #13

ONNX

NCNN