dhkim0225 / 1day_1paper

read 1 paper everyday (only weekday)

54 stars 1 forks source link

[81] Language Matters: A Weakly Supervised Pre-training Approach for Scene Text Detection and Spotting #110

Open dhkim0225 opened 2 years ago

dhkim0225 commented 2 years ago

bai 붙은 사람들은 다 OCR을 잘 하는 걸까? (~xiang bai 센세에 이어..~)

OCR task 를 위한 pretraining strategy 제안.

INTRO

3개의 pipeline 을 그림으로 표현

OCR pipeline
Vision-Language Pretraining (VLP) pipeline
proposed pipeline

그냥 VL 파이프라인 그으대로 가져온 형태. 다른 점은 character encode. image - text pair 정보 필요. transfer 하는 부분은 backbone 에 한정. decoder 나, text encoder 는 전부 버림.

Methodology

Character-Aware Text Encoder

character embedding (ce)은 다음 수식으로 구한다 n은 Text instance 개수 (i 로 indexing) t_i 는 특정 word 가 되는 거고, c^i_j 는 t_i 내부의 각 character 가 되는 형태. W_c 는 character embedding matrix PE 는 learnable setting 사용 (not sinusoidal)

character encoder 말고, 일반적으로 사용하는 encoder 로도 위 파이프라인을 학습해 봤다. 최종 transfer 대상인 visual encoder 영역의 attention 을 뽑아보면, 다음과 같다. character encoder 써야 좋다 ㅇㅇ

pretrain 할 때, 25 자 text까지만 input으로 활용한다. recognizer batch 는 이미지당 3 개까지만. 다시 말해, 이미지당 3개의 word 씩만 학습에 사용.

Visual-Textual Decoder

6 stacked decoder layer 이미지당 3개의 word 씩만 학습에 사용하므로, output도 3개가 나오게 된다. character encoder 25 개 각각 query 로 들어가므로, output 굳이 따지면 (B, 25, 3) ㅇㅇ

masking 은 그냥 word 당 하나씩만 넣음. ratio 분석하고 그런 거 없음.

Network Optimization

CLS loss 는 masking 위치에 한해서 CE loss

CL loss 는 CLIP 비스무레하게 가져감. image 기준, CL 에 text 기준 CL 해서 두 개 더해줌.

최종 loss. scaling 은 없음

Note

TCL (#104 ) 형태로 가져가면 성능향상 더 있지 않을까? (~물론 TCL 방식은 parameter 가 많이 늘어나긴 하지만.~)
뒷 단에 ELECTRA 형식의 pretraining 을 붙이면 잘 될 듯.
max-length 25 는 일본어에는 불리할 수 있다.
- 물론 더 늘려도 되긴 한다. 일본어는 근데, 25자 넘어가는 게 많으니까.

Impl. Detail

Pretraining

Encoder Backbone: ResNet-50
input image: 512x512 resized
optimizer: AdamW
scheduler
- init-LR: 1e-4
- cosine 1cycle
V100 * 8
640 batchsize
max-length 25

Finetuning

각 모델 설정 따름

PSENet
DB
FCENet
TextBPN
MTSv3

Results

ICDAR19-LSVT Detection

‘+Ours’ == IC19-LSVT 400,000 image 사용. pretrained model

ICDAR19-LSVT E2E

‘+Ours’ == IC19-LSVT 400,000 image 사용. pretrained model NED == Normalized Edit Distance

Pretrain Data portion 에 따른 성능

PSENet (Synth pretrain + TotalText finetune) ‘+Ours’ == synthtext pretrained model

다른 Pretraining technique 비교

pretrain set 은 synthtext 로 통일 ‘+Ours’ == synthtext pretrained model

CTW 1500 Detection

‘+Ours’ == synthtext pretrained model

TotalText Detection

‘+Ours’ == synthtext pretrained model

IC15 Detection

‘+Ours’ == synthtext pretrained model

IC15 & TotalText E2E

‘+Ours’ == synthtext pretrained model

Ablation

PSENet (Synth pretrain + TotalText finetune)

CAE == Character Aware Encoder VTD == Visual Textual Decoder BCL == Batch-level Contrastive Loss