Could training Parseq directly on image data containing text content sampled from the 400 million images used to train CLIP potentially yield better results?

VamosC / CLIP4STR

An implementation of "CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model".

Apache License 2.0

90 stars 12 forks source link

Closed Apostatee closed 1 month ago