An implementation of "CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model".
Apache License 2.0
90
stars
12
forks
source link
Could training Parseq directly on image data containing text content sampled from the 400 million images used to train CLIP potentially yield better results? #16