baudm / parseq

Scene Text Recognition with Permuted Autoregressive Sequence Models (ECCV 2022)
https://huggingface.co/spaces/baudm/PARSeq-OCR
Apache License 2.0
565 stars 126 forks source link

Thinner pre-Trained Model of tiny version #107

Closed clrk7 closed 10 months ago

clrk7 commented 1 year ago

Hello There!

first of al I'm really impressed how parseq is working on my machine. I got a lot more correct outputs than before with easyocr or ppocr. While I am only detecting numbers up to 3 digits for a specific type of sign I was wondering if someone maybe have a smaller model compared to the tiny model to decrease Inference Time even more. Actually I'm reading 2 to 4 images with 40x40 in 25ms. As I am quite new to ML and training some neural networks, maybe someone could share or help me to find a "nano" Version for my usecase

baudm commented 1 year ago

Thank you for your kind words.

The decoder is the most important part of PARSeq. A vanilla ViT was used for the encoder mainly for its simplicity and elegant fit with the overall model (and for all the advantages that Transformers have to offer).

Majority of the performance impact (memory, inference speed) is due to the encoder. For Transformers, given enough parallel processing cores, the inference time becomes a function of model depth mainly. You may experiment with more lightweight backbones such as the MobileNet family of models, MobileViT, etc.

If you still want to use the existing architecture, just modify the hyperparameters particularly enc_depth, enc_mlp_ratio, img_size, and patch_size. The first two modify the architecture, while the last two determine the number of tokens (more tokens, higher compute requirements).