（以下、備忘となります…。📝）

リサーチ結果（１）

論文概要

Baiduによる軽量OCR、CRNN等による実現です
速度と容量と精度のバランスに、多大な配慮を行っています
アルゴリズムの新規制よりは、実用に向けた調整方法を共有してくれるもので、地に足ついた非常に素晴らしい内容となっています
［①文字位置検出］→［②文字向き認識／補正］→［③内容認識］という手順になります
以下テクニックを導入されています
- CRNN
- Connectsionist Temporal Classification（CTC）損失
- light backbone（MobileNetV3）
- data augmentation（base data augmentation + RandAugment + TIAを採用）
  - base data augmentation（BDA）：回転、遠近法歪み、モーションブラー、ガウスノイズ等
- PACT quantization（MobileNetV3のhard swish向けの改良を実施）
- cosine learning rate decay、learning rate warm-up
- 入力画像とネットワークサイズの大小調整
- FPGM Pruner Pruning
- Remove SE（当該タスクでは、SEの精度貢献がほぼ無かったとのこと）
- optimizerはAdam
データセットについて…
- 中国語・英語認識のためのデータセット
  - 検出用：97,000枚
  - 方向分類用：600,000枚
  - テキスト認識用：17,900,000枚
- フランス語、韓国語、日本語、ドイツ語でも検証を行われています

リポジトリの動かし方

QUICK INSTALLATION→Quick start of Chinese OCR modelという順番で進めると、先ず「中国語＋英語」の認識が実施できます
- Google Colabにて、GPUによる推論を実施しようとしたのですが、paddle特有のエラーが出てしまい、それが解消できませんでした
- Google Colabにて、CPUによる推論を実施したところ、上手くOCR認識を行えました
日本語のOCR認識は、上記「中国語＋英語」の認識手順を、以下記事に沿って少し変更すれば実施ができます https://zenn.dev/shimat/articles/6ac851fbba2e0bae05c8 （※記事中のリポジトリへのリンクは、ブランチが少し古いので、その点だけ対応が必要です）
- リポジトリ内に用意されている画像にて、OCRを実施した結果が以下となります
detectionのモデルは、精度重視や、速度重視等、様々用意されているようですので、目的に応じて、切り替えられるようになっていると良さそうです https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.0-rc1-0/doc/doc_en/models_list_en.md

ONNX変換について

以下のリポジトリにて、実現ができそうです。 https://github.com/PaddlePaddle/paddle2onnx

onnxコンパイルから、inference実施までのサンプルNotebookもありました。 https://github.com/PaddlePaddle/Paddle2ONNX/blob/develop/examples/tutorial_dygraph2onnx.ipynb

リサーチ結果（２）

結論として、PaddleOCRのonnx化が難しいことが分かりました。以下、散文的になりますが、リサーチした結果を記載していきます。こちらを基に、対策等話し合えますと、有り難い次第です。

Paddle2ONNXでサポートしていないオペレーターが、PaddleOCRのモデル内で使用されている

Paddle2ONNXのREADMEは以下となります。 https://github.com/PaddlePaddle/Paddle2ONNX#paddle2onnx

Release Noteによれば、Paddleには、static graph modeと、dynamic graph modeとが存在するとのことです。恐らく、build and runと、build by runかと思われます。ビルドの際には、static graph modeが推奨されているようです。 https://www.paddlepaddle.org.cn/documentation/docs/en/release_note_en.html

PaddleOCRのモデルは、以下リンクに紹介されています。これらは、static graph modeにて保存されているものとなります。 https://github.com/PaddlePaddle/PaddleOCR#pp-ocr-20-series-model-listupdate-on-dec-15 https://github.com/PaddlePaddle/PaddleOCR/blob/dygraph/doc/doc_en/models_list_en.md

日本語OCRを実現するに当たり、必要な推論用モデルファイルをダウンロードしてみると、以下のようなレイアウトになっています。

ch_ppocr_mobile_v2.0_det_infer：文字位置のBoundingBoxを出力してくれる、検出モデル ch_ppocr_mobile_v2.0_cls_infer：検出した文字BoundingBoxの向きを推定してくれるモデル（defaultは、0° or 180°の2class識別で、後者と識別された場合はBoundingBox画像を180°回転） japan_mobile_v2.0_rec_infer：BoundingBox内画像の文言を認識してくれるモデル

.
├── ch_ppocr_mobile_v2.0_cls_infer
│   ├── inference.pdiparams
│   ├── inference.pdiparams.info
│   └── inference.pdmodel
├── ch_ppocr_mobile_v2.0_det_infer
│   ├── inference.pdiparams
│   ├── inference.pdiparams.info
│   └── inference.pdmodel
└── japan_mobile_v2.0_rec_infer
    ├── inference.pdiparams
    ├── inference.pdiparams.info
    └── inference.pdmodel

これらに対して、Paddle2ONNXを、READMEに従って適用していきます。 https://github.com/PaddlePaddle/Paddle2ONNX#static-computational-graph

すると、3つの対象モデルの内、2つが失敗してしまいました。

ch_ppocr_mobile_v2.0_det_infer：検出モデル、×失敗

`!paddle2onnx --model_dir ./PaddleOCR/inference/ch_ppocr_mobile_v2.0_det_infer \
             --model_filename inference.pdmodel \
             --params_filename inference.pdiparams \
             --save_file ./onnx/ch_ppocr_mobile_v2.0_det_infer.onnx \
             --opset_version 9 \
             --enable_onnx_checker False
▼
Traceback (most recent call last):
  File "/opt/anaconda3/envs/dev38/bin/paddle2onnx", line 33, in <module>
    sys.exit(load_entry_point('paddle2onnx==0.4', 'console_scripts', 'paddle2onnx')())
  File "/opt/anaconda3/envs/dev38/lib/python3.8/site-packages/paddle2onnx/command.py", line 133, in main
    program2onnx(
  File "/opt/anaconda3/envs/dev38/lib/python3.8/site-packages/paddle2onnx/command.py", line 106, in program2onnx
    p2o.program2onnx(
  File "/opt/anaconda3/envs/dev38/lib/python3.8/site-packages/paddle2onnx/convert.py", line 74, in program2onnx
    export_onnx(paddle_graph, save_file, opset_version, enable_onnx_checker)
  File "/opt/anaconda3/envs/dev38/lib/python3.8/site-packages/paddle2onnx/convert.py", line 30, in export_onnx
    onnx_graph = ONNXGraph.build(paddle_graph, opset_version, verbose)
  File "/opt/anaconda3/envs/dev38/lib/python3.8/site-packages/paddle2onnx/graph/onnx_graph.py", line 133, in build
    OpMapper.check_support_status(paddle_graph, opset_version)
  File "/opt/anaconda3/envs/dev38/lib/python3.8/site-packages/paddle2onnx/op_mapper/op_mapper.py", line 123, in check_support_status
    raise NotImplementedError(error_info)
NotImplementedError: 
There's 1 ops are not supported yet
=========== nearest_interp_v2 ===========

ch_ppocr_mobile_v2.0_cls_infer：角度識別モデル、○成功

`!paddle2onnx --model_dir ./PaddleOCR/inference/ch_ppocr_mobile_v2.0_det_infer \
             --model_filename inference.pdmodel \
             --params_filename inference.pdiparams \
             --save_file ./onnx/ch_ppocr_mobile_v2.0_det_infer.onnx \
             --opset_version 9 \
             --enable_onnx_checker False
▼
2021-01-23 15:16:35 [INFO]  ONNX model saved in ./onnx/ch_ppocr_mobile_v2.0_cls_infer.onnx

japan_mobile_v2.0_rec_infer：文字認識モデル、×失敗

!paddle2onnx --model_dir ./PaddleOCR/inference/japan_mobile_v2.0_rec_infer \
             --model_filename inference.pdmodel \
             --params_filename inference.pdiparams \
             --save_file ./onnx/japan_mobile_v2.0_rec_infer.onnx \
             --opset_version 11 \
             --enable_onnx_checker False
▼
Traceback (most recent call last):
  File "/opt/anaconda3/envs/dev38/bin/paddle2onnx", line 33, in <module>
    sys.exit(load_entry_point('paddle2onnx==0.4', 'console_scripts', 'paddle2onnx')())
  File "/opt/anaconda3/envs/dev38/lib/python3.8/site-packages/paddle2onnx/command.py", line 133, in main
    program2onnx(
  File "/opt/anaconda3/envs/dev38/lib/python3.8/site-packages/paddle2onnx/command.py", line 106, in program2onnx
    p2o.program2onnx(
  File "/opt/anaconda3/envs/dev38/lib/python3.8/site-packages/paddle2onnx/convert.py", line 74, in program2onnx
    export_onnx(paddle_graph, save_file, opset_version, enable_onnx_checker)
  File "/opt/anaconda3/envs/dev38/lib/python3.8/site-packages/paddle2onnx/convert.py", line 30, in export_onnx
    onnx_graph = ONNXGraph.build(paddle_graph, opset_version, verbose)
  File "/opt/anaconda3/envs/dev38/lib/python3.8/site-packages/paddle2onnx/graph/onnx_graph.py", line 133, in build
    OpMapper.check_support_status(paddle_graph, opset_version)
  File "/opt/anaconda3/envs/dev38/lib/python3.8/site-packages/paddle2onnx/op_mapper/op_mapper.py", line 123, in check_support_status
    raise NotImplementedError(error_info)
NotImplementedError: 
There's 2 ops are not supported yet
=========== rnn ===========
=========== fill_constant_batch_size_like ===========

失敗の理由としては、以下Paddleオペレーターが、Paddle2ONNXでサポートされていないとのことでした。

nearest_interp_v2
fill_constant_batch_size_like
rnn

尚、Paddle2ONNXがサポートしているオペレーターについての情報が以下となります。 https://github.com/PaddlePaddle/Paddle2ONNX/blob/develop/docs/en/op_list.md

こちらによれば、nearest_interp_v2と、fill_constant_batch_size_likeとは、サポート対象となっているのですが、Paddle2ONNXのversionを幾つか試しても、エラー解消できませんでした。 rnnは、サポートに含まれていませんでした。

同様事象で悩んでいる方がいないかリサーチ

PaddleOCRリポジトリにて、onnx関連のissueを探してみたところ、当面サポートしないとの記載が幾つか上がってきました。 Paddleには、Paddle Liteという、独自のmobile等への展開機構があるようで、そちらが有線であるようでした。

サポート無し、検討はしてみるとの旨 https://github.com/PaddlePaddle/PaddleOCR/issues/150
検出モデルは、X2Paddleにてonnx化できたが、認識モデルはできなかったとのこと https://github.com/PaddlePaddle/PaddleOCR/issues/213
X2Paddleにて、onnx化はできたが、runtimeエラーとなった旨 https://github.com/PaddlePaddle/PaddleOCR/issues/373
小職同様、onx変換時のエラー https://github.com/PaddlePaddle/PaddleOCR/issues/549
onnxをサポートしていない為、PaddleLiteを仕様下さいとの旨 https://github.com/PaddlePaddle/PaddleOCR/issues/854
dynamic graph modeのPaddleOCRから、onnx化が可能とあるとの旨（しかしながら、学習済み推論用モデルは、依然としてstatic graph mode）

PaddleOCRリポジトリの、onnx関連のissueを調べていると、X2Paddleというキーワードが上がってきたので調べてみると、幾つかのdeep learning frameworkを、paddleに変換するというものとのことでした。また、その機能の一部として、以前は、paddle → onnxの機能もあったらしいが、それは現在、Paddle2ONNXに移管されたとのことです。

x2Paddleの、paddle → onnxの機能を使おうとすると、Paddle2ONNXを紹介されるだけで、機能しないように制御されている次第です。

!x2paddle -f paddle2onnx \
          -m ./PaddleOCR/inference/ch_ppocr_mobile_v2.0_cls_infer/ \
          -s ./onnx
▼
paddle.__version__ = 2.0.0-rc1
Paddle to ONNX tool has been migrated to the new github: https://github.com/PaddlePaddle/paddle2onnx

paddleのversionを、1.8.x等に調整してみても、同じメッセージが出力されました。

また、Paddle2ONNXリポジトリの、ocr関連のissueも調べてみましたが、onnx化失敗を解消する方法等はあがってきませんでした。

その他、ネット上で色々と検索を掛けてみたのですが、PaddleOCRのONNX化エラーを解消するための情報、かつ、信頼性の高い情報には、辿り着くことができませんでした。直接的に、PaddleOCRの推論用モデルを、onnx化することは難しいように思えます。

PaddleOCR2Pytorchを挟んでの、onnx化が可能である様子

それでは、間接的に解決する方法は無いかと調べていたところ、以下リポジトリを見つけることができました。 https://github.com/frotms/PaddleOCR2Pytorch

PaddleOCRを、pytorchに変換することに特化したリポジトリです。有り難いことに、convert用コードを実装して下さっています。近日中実装であり、starも少ないですが、見たところコードもキレイ（元のPaddleOCRコードを大きく崩さずに上手くconvertされている印象）なようですので、試してみようと思います。

リサーチ結果（３）

結論として、PaddleOCR2Pytorchというリポジトリを用いて、PaddleOCRのONNXエクスポートを実現できました。 PaddleOCRを、一度、Pytorchにコンバートしてから、torch.onnxを用いて、ONNXのsessionファイルをエクスポートします。

PaddleOCR2Pytorchについて

PaddleOCRは、static graph版と、dynamic graph版とで、その機能を提供してくれていますが、その後者、dynamic graph版を、Pytorchへと移植してくれるリポジトリが、有り難いことに存在しました。それが、PaddleOCR2Pytorchになります。 PaddleOCRの機能実現を、Pytorchにて、実現できるものです。私の方で動作確認した限り、出力内容は、ほぼ完全一致しているものと思われました。（※厳密な確認ではなく、同値確認の必要あれば、仰って下さい。）

リポジトリ内には、converterプログラムがあり、PaddleOCRの基本的なモデルについては、それを起動することによって、Pytorch用のモデルファイル「.pth」に変換することができます。

尚、PaddleOCRの基本的なモデルというのは、以下図のパイプラインに基づく、「①文字位置検出」「②文字向き識別」「③文字内容認識」の3つとなります。かつ、その中国語版となります。

また、中国語以外を認識をしたい場合、例えば、日本語の認識をしたい場合、PaddleOCRのREADMEによれば、以下パイプラインにて、一先ずは、それを実現できると記載されています。

「①中国語の文字位置検出」→「②中国語の文字向き識別」→「③'**日本語**の文字内容認識」

ここで、「①中国語の文字位置検出」、「②中国語の文字向き識別」、「③中国語の文字内容認識」については、converterが存在しますが、「③'日本語の文字内容認識」については、converterが存在しません。そこで、「③中国語の文字内容認識」のconverterを元に、「③'日本語の文字内容認識」のconverterを実装する必要があります。但し、ネットワーク構成等は両者全て同じで、違うのは認識対象の文字数のみです。「③文字内容認識」はCRNNによって実現がされていますが、その出力クラス数を変更するだけで対応できます。

converterが実施していることは、paddleの動的グラフを順繰りに参照しながら、pytorchの動的グラフへと、重みをコピーしていく形です。つまり、PaddleOCR2Pytorchにおいては、予めPytorchにて、PaddleOCRにて定義されたモデルとの同形アーキテクチャーを、実装定義してあります。 Backbone、Neck、Head等が実装されています。そして、PaddleとPytorchとで、重みの管理方式は同様（恐らく、前者が後者を参考に作られている）であるようで、シンプルに以下のようなコードで移植ができるようです。

self.net.state_dict()[k].copy_(torch.Tensor(para_state_dict[ppname])) # paddle -> pytorch

そうして、convertされたPytorchの重みでもって、推論を実行するコードも用意されています。このconvertと、こまでの1セットが、PaddleOCR2Pytorchというリポジトリにて、実装されているものとなります。

ちなみに、このPaddleOCR2Pytorchは、PytorchOCRというリポジトリの構造を参考に作られているとのことです。 PytorchOCRの構造に、PaddleOCRのモデルをconvertして取り組むという、そういう実装になっているようです。

尚、OCR系のアルゴリズムは、通常の物体検出などとは少し異なる専門用語が出てきたりしますが、それらの多くについて、丁寧に解説してくださっている記事が以下となります。知らないアルゴリズム名称が登場した時の参考として、非常に重宝すると思いました。 https://qiita.com/yoyoyoyoyo/items/96098354a1c0af18450d

PaddleOCRのパイプライン別入力形式と、ONNXエクスポート時のdynamic_axesについて

パイプライン全体像

先程も紹介しましたが、PaddleOCRの推論は、以下図のパイプラインにて実現されています。

onnxエクスポートを行うに当たっては、各パイプラインの仕様別に、入力画像のサイズを可変にする必要があります。

Text Detection（①文字位置検出）

PaddleOCR上では、detectionは、以下の3つアルゴリズムを、オプション指定により採用することができるとのことです。

（１）DB（Real-time Scene Text Detection with Differentiable Binarization）
- semantic segmentationベースで、文字位置を検出するアルゴリズム

（２）EAST（An Efficient and Accurate Scene Text Detector）
- fully convolutional networkベースで、文字位置を検出するアルゴリズム

（３）SAST（A Single-Shot Arbitrarily-Shaped Text Detector based on Context Attended Multi-Task Learning）
- semantic segmentationベースで、文字位置を検出するアルゴリズム

PaddleOCRとしては、恐らくは、（１）が本命であるようです。（２）は、より高速化を目指した手法かと思われます。（３）は、より高精度化を目指した手法かと思われます。しかしながら、（２）と（３）は、現時点で、モデルの成熟度が低いようです。（３）については、アルファベット対応の辞書のみが存在との旨が、issueに記載されていました。

PaddleOCR2Pytorchにおいては、現段階では、（１）のみが移植済みであり、（２）と（３）は未移植であるようです。（２）（３）のアーキテクチャーを定義したモジュールが、リポジトリ内に存在しませんでした。かつ、該当と思われるコード箇所には、TBC likeなコメントが書かれておりました。その為、（１）のアルゴリズムのみが採用できます。

また、文字位置検出においては、入力画像の横pixel（横幅／width）に対して、最大pixelか、最小pixelを指定するような前処理を実施しています。最小pixelは、configから自由に設定できる形です。縦pixel（縦幅／height）に対しての調整は、横pixelの調整に、アスペクト比をキープしつつ順ずる形です。

その為、onnxエクスポートをする際には、dynamic_axesは、横pixelと縦pixelについて、配慮する必要があり、コードとしては以下のようになります。

# Input to the model
x = torch.randn(1, 3, 960, 1280, requires_grad=True)

# Export the model
torch.onnx.export(converter.net,             # model being run
                  x,                         # model input (or a tuple for multiple inputs)
                  "./onnx/ch_ppocr_server_v2.0_det_train.onnx",  # where to save the model (can be a file or file-like object)
                  export_params=True,        # store the trained parameter weights inside the model file
                  opset_version=10,          # the ONNX version to export the model to
                  do_constant_folding=True,  # whether to execute constant folding for optimization
                  input_names = ['input'],   # the model's input names
                  output_names = ['output'], # the model's output names
                  dynamic_axes={'input' : {0 : 'batch_size', 
                                           2 : 'height_size', 
                                           3 : 'width_size'},    # variable lenght axes
                                'output' : {0 : 'batch_size', 
                                            2 : 'height_size', 
                                            3 : 'width_size'}})

尚、エクスポート前のPytorchのモデルを、torch-summaryで表示すると、以下となります。

==========================================================================================
Layer (type:depth-idx)                   Output Shape              Param #
==========================================================================================
├─ResNet: 1-1                            [-1, 64, 240, 320]        --
|    └─ConvBNLayer: 2-1                  [-1, 32, 480, 640]        --
|    |    └─Conv2d: 3-1                  [-1, 32, 480, 640]        864
|    |    └─BatchNorm2d: 3-2             [-1, 32, 480, 640]        64
|    |    └─Activation: 3-3              [-1, 32, 480, 640]        --
|    └─ConvBNLayer: 2-2                  [-1, 32, 480, 640]        --
|    |    └─Conv2d: 3-4                  [-1, 32, 480, 640]        9,216
|    |    └─BatchNorm2d: 3-5             [-1, 32, 480, 640]        64
|    |    └─Activation: 3-6              [-1, 32, 480, 640]        --
|    └─ConvBNLayer: 2-3                  [-1, 64, 480, 640]        --
|    |    └─Conv2d: 3-7                  [-1, 64, 480, 640]        18,432
|    |    └─BatchNorm2d: 3-8             [-1, 64, 480, 640]        128
|    |    └─Activation: 3-9              [-1, 64, 480, 640]        --
|    └─MaxPool2d: 2-4                    [-1, 64, 240, 320]        --
|    └─ModuleList: 2                     []                        --
|    |    └─Sequential: 3-10             [-1, 64, 240, 320]        152,192
|    |    └─Sequential: 3-11             [-1, 128, 120, 160]       525,568
|    |    └─Sequential: 3-12             [-1, 256, 60, 80]         2,099,712
|    |    └─Sequential: 3-13             [-1, 512, 30, 40]         8,393,728
├─DBFPN: 1-2                             [-1, 256, 240, 320]       --
|    └─Conv2d: 2-5                       [-1, 256, 30, 40]         131,072
|    └─Conv2d: 2-6                       [-1, 256, 60, 80]         65,536
|    └─Conv2d: 2-7                       [-1, 256, 120, 160]       32,768
|    └─Conv2d: 2-8                       [-1, 256, 240, 320]       16,384
|    └─Conv2d: 2-9                       [-1, 64, 30, 40]          147,456
|    └─Conv2d: 2-10                      [-1, 64, 60, 80]          147,456
|    └─Conv2d: 2-11                      [-1, 64, 120, 160]        147,456
|    └─Conv2d: 2-12                      [-1, 64, 240, 320]        147,456
├─DBHead: 1-3                            [[-1, 1, 960, 1280]]      --
|    └─Head: 2-13                        [-1, 1, 960, 1280]        --
|    |    └─Conv2d: 3-14                 [-1, 64, 240, 320]        147,456
|    |    └─BatchNorm2d: 3-15            [-1, 64, 240, 320]        128
|    |    └─Activation: 3-16             [-1, 64, 240, 320]        --
|    |    └─ConvTranspose2d: 3-17        [-1, 64, 480, 640]        16,448
|    |    └─BatchNorm2d: 3-18            [-1, 64, 480, 640]        128
|    |    └─Activation: 3-19             [-1, 64, 480, 640]        --
|    |    └─ConvTranspose2d: 3-20        [-1, 1, 960, 1280]        257
==========================================================================================
Total params: 12,199,969
Trainable params: 12,199,969
Non-trainable params: 0
Total mult-adds (G): 42.87
==========================================================================================
Input size (MB): 14.06
Forward/backward pass size (MB): 1233.40
Params size (MB): 46.54
Estimated Total Size (MB): 1294.00
==========================================================================================

Detection Boxes Recify（②文字向き識別）

文字向き識別の入力形式は、非常にシンプルで、入力画像の縦横pixelを、固定の大きさリサイズします。前段の文字位置検出処理によって抽出された検出領域（Bounding Box）をピックアップし、それを固定の大きさにリサイズします。 defaultとしては、画像入力サイズは、以下となっています。

dc['cls_image_shape'] = '3, 48, 192'

RGB3チャンネル、縦48pixel、横192pixelです。

その為、onnxエクスポートをする際には、dynamic_axesの配慮は不要となり、以下コードとなります。

# Input to the model
x = torch.randn(1, 3, 48, 192, requires_grad=True)

# Export the model
torch.onnx.export(converter.net,             # model being run
                  x,                         # model input (or a tuple for multiple inputs)
                  "./onnx/ch_ppocr_mobile_v2.0_cls_train.onnx",  # where to save the model (can be a file or file-like object)
                  export_params=True,        # store the trained parameter weights inside the model file
                  opset_version=10,          # the ONNX version to export the model to
                  do_constant_folding=True,  # whether to execute constant folding for optimization
                  input_names = ['input'],   # the model's input names
                  output_names = ['output'], # the model's output names
                  dynamic_axes={'input' : {0 : 'batch_size'},    # variable lenght axes
                                'output' : {0 : 'batch_size'}})

尚、エクスポート前のPytorchのモデルを、torch-summaryで表示すると、以下となります。

==========================================================================================
Layer (type:depth-idx)                   Output Shape              Param #
==========================================================================================
├─MobileNetV3: 1-1                       [-1, 200, 1, 48]          --
|    └─ConvBNLayer: 2-1                  [-1, 8, 24, 96]           --
|    |    └─Conv2d: 3-1                  [-1, 8, 24, 96]           216
|    |    └─BatchNorm2d: 3-2             [-1, 8, 24, 96]           16
|    |    └─Activation: 3-3              [-1, 8, 24, 96]           --
|    └─Sequential: 2-2                   [-1, 32, 2, 96]           --
|    |    └─ResidualUnit: 3-4            [-1, 8, 12, 96]           290
|    |    └─ResidualUnit: 3-5            [-1, 8, 6, 96]            712
|    |    └─ResidualUnit: 3-6            [-1, 8, 6, 96]            944
|    |    └─ResidualUnit: 3-7            [-1, 16, 3, 96]           2,280
|    |    └─ResidualUnit: 3-8            [-1, 16, 3, 96]           9,382
|    |    └─ResidualUnit: 3-9            [-1, 16, 3, 96]           9,382
|    |    └─ResidualUnit: 3-10           [-1, 16, 3, 96]           3,322
|    |    └─ResidualUnit: 3-11           [-1, 16, 3, 96]           4,172
|    |    └─ResidualUnit: 3-12           [-1, 32, 2, 96]           13,610
|    |    └─ResidualUnit: 3-13           [-1, 32, 2, 96]           38,914
|    |    └─ResidualUnit: 3-14           [-1, 32, 2, 96]           38,914
|    └─ConvBNLayer: 2-3                  [-1, 200, 2, 96]          --
|    |    └─Conv2d: 3-15                 [-1, 200, 2, 96]          6,400
|    |    └─BatchNorm2d: 3-16            [-1, 200, 2, 96]          400
|    |    └─Activation: 3-17             [-1, 200, 2, 96]          --
|    └─MaxPool2d: 2-4                    [-1, 200, 1, 48]          --
├─ClsHead: 1-2                           [-1, 2]                   --
|    └─AdaptiveAvgPool2d: 2-5            [-1, 200, 1, 1]           --
|    └─Linear: 2-6                       [-1, 2]                   402
==========================================================================================
Total params: 129,356
Trainable params: 129,356
Non-trainable params: 0
Total mult-adds (M): 2.22
==========================================================================================
Input size (MB): 0.11
Forward/backward pass size (MB): 0.87
Params size (MB): 0.49
Estimated Total Size (MB): 1.47
==========================================================================================

Text Recognition（③文字内容認識）

文字内容認識の入力形式は、少し複雑です。 CRNNアーキテクチャーになっており、defaultでは、縦pixelが32pixelの固定となっています。そして、横pixelは可変となります。文字の一連は短長ありますが、CRNNによって、それに対応できる形となっています。

また、検出された文字位置の数だけ、文字認識を実施するのですが、PaddleOCRでは、それらをアスペクト比でsortした上で、一定のbatch sizeずつにグループ分けした上で、そのグループ内の最大の横pixelにて、統一してリサイズを行うようにしています。

つまり、以下のような検出結果（Bounding Box）があったとして、batch sizeが2であったとしたら…

検出結果１：縦32pixel、横100pixel
検出結果２：縦32pixel、横110pixel
検出結果３：縦32pixel、横120pixel
検出結果４：縦32pixel、横130pixel
検出結果５：縦64pixel、横280pixel

文字認識のCRNNモデルへの入力としては、以下のようにまとめられます。

バッチ１：入力shape = (2, 3, 32, 110)
バッチ２：入力shape = (2, 3, 32, 130)
バッチ３：入力shape = (1, 3, 32, 140)

その為、onnxエクスポートをする際には、横pixelへのdynamic_axesの配慮が必要となり、以下コードとなります。

# Input to the model
x = torch.randn(1, 3, 32, 320, requires_grad=True)

# Export the model
torch.onnx.export(converter.net,             # model being run
                  x,                         # model input (or a tuple for multiple inputs)
                  "./onnx/japan_mobile_v2.0_rec_infer.onnx",  # where to save the model (can be a file or file-like object)
                  export_params=True,        # store the trained parameter weights inside the model file
                  opset_version=10,          # the ONNX version to export the model to
                  do_constant_folding=True,  # whether to execute constant folding for optimization
                  input_names = ['input'],   # the model's input names
                  output_names = ['output'], # the model's output names
                  dynamic_axes={'input' : {0 : 'batch_size', 
                                           3 : 'width_size'},    # variable lenght axes
                                'output' : {0 : 'batch_size', 
                                            1 : 'width_size'}})

尚、エクスポート前のPytorchのモデルを、torch-summaryで表示すると、以下となります。

==========================================================================================
Layer (type:depth-idx)                   Output Shape              Param #
==========================================================================================
├─MobileNetV3: 1-1                       [-1, 200, 1, 48]          --
|    └─ConvBNLayer: 2-1                  [-1, 8, 24, 96]           --
|    |    └─Conv2d: 3-1                  [-1, 8, 24, 96]           216
|    |    └─BatchNorm2d: 3-2             [-1, 8, 24, 96]           16
|    |    └─Activation: 3-3              [-1, 8, 24, 96]           --
|    └─Sequential: 2-2                   [-1, 32, 2, 96]           --
|    |    └─ResidualUnit: 3-4            [-1, 8, 12, 96]           290
|    |    └─ResidualUnit: 3-5            [-1, 8, 6, 96]            712
|    |    └─ResidualUnit: 3-6            [-1, 8, 6, 96]            944
|    |    └─ResidualUnit: 3-7            [-1, 16, 3, 96]           2,280
|    |    └─ResidualUnit: 3-8            [-1, 16, 3, 96]           9,382
|    |    └─ResidualUnit: 3-9            [-1, 16, 3, 96]           9,382
|    |    └─ResidualUnit: 3-10           [-1, 16, 3, 96]           3,322
|    |    └─ResidualUnit: 3-11           [-1, 16, 3, 96]           4,172
|    |    └─ResidualUnit: 3-12           [-1, 32, 2, 96]           13,610
|    |    └─ResidualUnit: 3-13           [-1, 32, 2, 96]           38,914
|    |    └─ResidualUnit: 3-14           [-1, 32, 2, 96]           38,914
|    └─ConvBNLayer: 2-3                  [-1, 200, 2, 96]          --
|    |    └─Conv2d: 3-15                 [-1, 200, 2, 96]          6,400
|    |    └─BatchNorm2d: 3-16            [-1, 200, 2, 96]          400
|    |    └─Activation: 3-17             [-1, 200, 2, 96]          --
|    └─MaxPool2d: 2-4                    [-1, 200, 1, 48]          --
├─ClsHead: 1-2                           [-1, 2]                   --
|    └─AdaptiveAvgPool2d: 2-5            [-1, 200, 1, 1]           --
|    └─Linear: 2-6                       [-1, 2]                   402
==========================================================================================
Total params: 129,356
Trainable params: 129,356
Non-trainable params: 0
Total mult-adds (M): 2.22
==========================================================================================
Input size (MB): 0.11
Forward/backward pass size (MB): 0.87
Params size (MB): 0.49
Estimated Total Size (MB): 1.47
==========================================================================================

今後、PaddlePaddleからPytorchへのConvertが必要となった際の一般方法について

調べた限り、PaddlePaddleのmodelを、直接ONNXに変換することが難しそうな印象を受けました。その為、今後、PaddleのmodelをONNXに変換する場合には、一度、Pytorchに変換してから、torch.onnxによるONNX変換を行う必要がありそうです。その際の、PaddleからPytorchに変換について、方法論としては、以下などが参考となるようです。 PaddleOCR2Pytorchリポジトリも、基本的には、以下方針に沿ってConvertがなされています。 https://blog.csdn.net/qq_22764813/article/details/108019285 https://github.com/maomaoyuchengzi/paddlepaddle_param_to_pyotrch

要するには、Paddleのmodelと同じネットワーク構造等を、Pytorchでも同様に実装し、重みをコピーすれば実現ができる形となります。

リサーチ結果（４）

「from shapely.geometry import Polygon」と「import pyclipper」の除去について

ひょっとすると、今後も同様の事象が起こるかもしれませんので、念の為、備忘として残させて頂こうと思います。

paddle ocrにおいては、テキスト検出からテキスト認識へと、処理を移行する際に、以下のような処理を行っています。

イラスト625

このような処理を行うに当たっての幾何計算に、shapely.geometry と pyclipper というライブラリを使用しています。これらライブラリは、ailiaのrequirements.txtに載っていないものであり、ニッチなライブラリである為、numpyを用いた計算ロジックに置き換えたいと思います。

shapely.geometry については、以下のような計算処理が行われています。

from shapely.geometry import Polygon

unclip_ratio = 1.6

box = np.array([[1043., 118.], 
                [1267., 118.],
                [1267., 141.],
                [1043., 141.]])

poly = Polygon(box)
distance = poly.area * unclip_ratio / poly.length

print('distance =', distance)

その計算結果は、以下となります。

shapely.geometry の用途は、bboxの面積と、bboxの外周の長さとを求めている形になります。その為、以下ロジックに置き換えることができました。

import numpy as np

unclip_ratio = 1.6

box = np.array([[1043., 118.], 
                [1267., 118.],
                [1267., 141.],
                [1043., 141.]])

poly_area = (np.sqrt(np.sum((box[0, :] - box[1, :])**2)) * 
             np.sqrt(np.sum((box[0, :] - box[3, :])**2)))
poly_length = (np.sqrt(np.sum((box[0, :] - box[1, :])**2)) + 
               np.sqrt(np.sum((box[0, :] - box[3, :])**2))) * 2

print('distance =', distance)

計算結果は、以下となります。

尚、この計算結果の distance は、テキスト検出のbboxに対する、1回り外側の角丸外接矩形を生成するのに使用します。具体的には、角丸部分の半径の大きさに採用します。

角丸部分の半径とは、以下図のような概念です。半径が大きければ、角がより丸くなり、半径が小さければ、角の丸みが小さくなります。

pyclipper は、その角丸の外接矩形を作成するのに使用されています。

例えば、以下のようなコードになります。

# import...
import pyclipper
import numpy as np
import matplotlib.pyplot as plt

# set value...
box = np.array([[100, 150], 
                [300, 150], 
                [300, 100], 
                [100, 100]])

# set param...
distance = 20.0

# ----------------------------------------------------------------
# target process
pco = pyclipper.PyclipperOffset()

pco.AddPath(box, pyclipper.JT_ROUND, pyclipper.ET_CLOSEDPOLYGON)

expanded = pco.Execute(distance)
expanded = np.array(expanded[0])
# ----------------------------------------------------------------

# show...
plt.figure(figsize=(10, 10), dpi=100)
plt.scatter(box[:, 0], box[:, 1], s=300, alpha=0.5)
plt.plot(np.concatenate([box[:, 0], box[[0], 0]]), 
         np.concatenate([box[:, 1], box[[0], 1]]), linewidth=5, alpha=0.5)
plt.scatter(expanded[:, 0], expanded[:, 1], s=300, alpha=0.5)
plt.plot(np.concatenate([expanded[:, 0], expanded[[0], 0]]), 
         np.concatenate([expanded[:, 1], expanded[[0], 1]]), linewidth=5, alpha=0.5)
plt.grid(True)
plt.axis('equal')
plt.gca().invert_yaxis()
plt.show()

その計算結果は、以下となります。

元のbboxが傾いている場合だと、以下のようなコードと計算結果になります。

# import...
import pyclipper
import numpy as np
import matplotlib.pyplot as plt

# set value...
box = np.array([[563.15510, 306.31964],
                [952.10626, 352.72858],
                [946.29425, 401.43854],
                [557.34310, 355.02960]])

# set param...
distance = 20.0

# ----------------------------------------------------------------
# target process
pco = pyclipper.PyclipperOffset()

pco.AddPath(box, pyclipper.JT_ROUND, pyclipper.ET_CLOSEDPOLYGON)

expanded = pco.Execute(distance)
expanded = np.array(expanded[0])
# ----------------------------------------------------------------

# show...
plt.figure(figsize=(10, 10), dpi=100)
plt.scatter(box[:, 0], box[:, 1], s=300, alpha=0.5)
plt.plot(np.concatenate([box[:, 0], box[[0], 0]]), 
         np.concatenate([box[:, 1], box[[0], 1]]), linewidth=5, alpha=0.5)
plt.scatter(expanded[:, 0], expanded[:, 1], s=300, alpha=0.5)
plt.plot(np.concatenate([expanded[:, 0], expanded[[0], 0]]), 
         np.concatenate([expanded[:, 1], expanded[[0], 1]]), linewidth=5, alpha=0.5)
plt.grid(True)
plt.axis('equal')
plt.gca().invert_yaxis()
plt.show()

上記処理を、numpyのスクラッチで実施する場合、以下のようなアルゴリズムとなります。

（１）青色のbbox矩形の上辺と、x軸とが、平行になるように、回転補正する
（２）青色のbboxの四角に、任意半径の円座標を展開する
（３）（２）の内、角丸矩形の、内側に入り込んでいないい点だけを残す

これを、プログラムで書くと、以下のようになりました。尚、コード中、法則性に若干の違和感がある箇所があるかもしれませんが、それは pyclipper との同値を取るために、調整した箇所となっています。

# import...
import cv2
import numpy as np
import matplotlib.pyplot as plt

# set value...
box = np.array([[563.15510, 306.31964],
                [952.10626, 352.72858],
                [946.29425, 401.43854],
                [557.34310, 355.02960]])

# set param...
distance = 20.0

# ----------------------------------------------------------------
# target process

# calc angle between upper side of bbox with x axis
u = box[1] - box[0]
v = box[1] - box[0]
v[1] = 0
i = np.inner(u, v)
n = np.linalg.norm(u) * np.linalg.norm(v)
c = i / n
angle = np.rad2deg(np.arccos(np.clip(c, -1.0, 1.0)))

# rotate coordinate
def xyrotate(coord_xy, angle, center_xy):
    # exec rotate
    rotation_matrix = cv2.getRotationMatrix2D((center_xy[0], center_xy[1]), angle, 1)
    # make variable for output
    coord_xy_rotated = np.zeros(np.shape(coord_xy))
    # loop of coordinate
    for coord_i in range(len(coord_xy)):
        # set x, y
        coord_x_tmp = coord_xy[coord_i, 0]
        coord_y_tmp = coord_xy[coord_i, 1]
        # slide to suit center of rotation
        coord_x_tmp -= center_xy[0]
        coord_y_tmp -= center_xy[1]
        # exec rotation
        coord_xy_tmp        = np.array([coord_x_tmp, coord_y_tmp])[:, np.newaxis]
        rotation_matrix_tmp = np.array([[np.cos(-angle/180*np.pi), 
                                         -np.sin(-angle/180*np.pi)], 
                                        [np.sin(-angle/180*np.pi), 
                                         np.cos(-angle/180*np.pi)]])
        coord_xy_tmp        = rotation_matrix_tmp @ coord_xy_tmp
        # re-slide to suit center of rotation
        coord_xy_tmp     = coord_xy_tmp.reshape(-1)
        coord_xy_tmp[0] += center_xy[0]
        coord_xy_tmp[1] += center_xy[1]
        # stock
        coord_xy_rotated[coord_i, :] = coord_xy_tmp

    return coord_xy_rotated

# exec coordinates rotation 
box_ = xyrotate(coord_xy=box, angle=angle, center_xy=np.mean(box, axis=0))

# calculate circle coordinates
pitch = 10
x_upper = np.cos(np.arange(1, 0, (-1/pitch)) * np.pi) * distance
y_upper = -np.sqrt(distance**2 - x_upper**2)
x_lower = np.cos(np.arange(0, 1, (1/pitch)) * np.pi) * distance
y_lower = np.sqrt(distance**2 - x_lower**2)
x = np.concatenate([x_upper, x_lower])
y = np.concatenate([y_upper, y_lower])
circle = np.concatenate([x[:, np.newaxis], y[:, np.newaxis]], axis=1)

# calculate circle coordinates around four corners
expanded = []
for box_tmp in box_:
    expanded.append(circle + box_tmp)
expanded = np.array(expanded).reshape(-1, 2)

# narrow down circle coordinates to outside 
expanded = expanded[[25, 26, 27, 28, 29, 30, 50, 51, 52, 53, 54, 55, 
                     75, 76, 77, 78, 79, 60,  0,  1,  2,  3,  4,  5]]

# exec coordinates re-rotation 
expanded = xyrotate(coord_xy=expanded, angle=-angle, center_xy=np.mean(box_, axis=0))

# ----------------------------------------------------------------

# show...
plt.figure(figsize=(10, 10), dpi=100)
plt.scatter(box[:, 0], box[:, 1], s=300, alpha=0.5)
plt.plot(np.concatenate([box[:, 0], box[[0], 0]]), 
         np.concatenate([box[:, 1], box[[0], 1]]), linewidth=5, alpha=0.5)
plt.scatter(expanded[:, 0], expanded[:, 1], s=300, alpha=0.5)
plt.plot(np.concatenate([expanded[:, 0], expanded[[0], 0]]), 
         np.concatenate([expanded[:, 1], expanded[[0], 1]]), linewidth=5, alpha=0.5)
plt.grid(True)
plt.axis('equal')
plt.gca().invert_yaxis()
plt.show()

expanded_ = expanded

計算結果は、以下となります。

出力値の比較をしてみると、以下のようになりました。このケースで言えば、問題が無さそうでした。

しかし、その後、幾つかのケースで、同値確認を行ってみたところ、並びや、点の選定基準、及び、作成数などで、微妙に差が生じました。しかし、その差が実処理に問題の無い範囲であることを、後続の処理結果にて、差分を取って確認をしました。値のスケールと比較して、凡そ1%未満の差分のみとなっております。念の為、同値確認を行った版については、githubの履歴（33e8ccc）に残させて頂きました。尚、最終的な結果についても、問題はありませんでした。

以上、ライブラリの除去に際して行ったことの備忘でした。

PaddleOCRでの学習方法について、https://github.com/axinc-ai/retrain-paddle-ocr/issues/1 に引き続きます。日本語と英語の、文字内容認識モデルの、精度向上を目指すものとなります。

axinc-ai / ailia-models

ADD paddle ocr #310

リサーチ結果（１）

論文概要

リポジトリの動かし方

ONNX変換について

リサーチ結果（２）

Paddle2ONNXでサポートしていないオペレーターが、PaddleOCRのモデル内で使用されている

同様事象で悩んでいる方がいないかリサーチ

PaddleOCR2Pytorchを挟んでの、onnx化が可能である様子

リサーチ結果（３）

PaddleOCR2Pytorchについて

PaddleOCRのパイプライン別入力形式と、ONNXエクスポート時のdynamic_axesについて

パイプライン全体像

Text Detection（①文字位置検出）

Detection Boxes Recify（②文字向き識別）

Text Recognition（③文字内容認識）

今後、PaddlePaddleからPytorchへのConvertが必要となった際の一般方法について

リサーチ結果（４）

「from shapely.geometry import Polygon」と「import pyclipper」の除去について