Thinner pre-Trained Model of tiny version

baudm / parseq

Scene Text Recognition with Permuted Autoregressive Sequence Models (ECCV 2022)

Apache License 2.0

565 stars 126 forks source link

Thank you for your kind words.

The decoder is the most important part of PARSeq. A vanilla ViT was used for the encoder mainly for its simplicity and elegant fit with the overall model (and for all the advantages that Transformers have to offer).

Majority of the performance impact (memory, inference speed) is due to the encoder. For Transformers, given enough parallel processing cores, the inference time becomes a function of model depth mainly. You may experiment with more lightweight backbones such as the MobileNet family of models, MobileViT, etc.

If you still want to use the existing architecture, just modify the hyperparameters particularly enc_depth, enc_mlp_ratio, img_size, and patch_size. The first two modify the architecture, while the last two determine the number of tokens (more tokens, higher compute requirements).

baudm / parseq

Thinner pre-Trained Model of tiny version #107