Closed seekingdeep closed 2 years ago
Hi,
DAN encoder is trained on text line recognition, through the use of another model architecture, using adaptive max pooling and the CTC loss. Then, it used for transfer learning to train the DAN on documents images.
DAN could recognize such complex examples (newpapers) with appropriate training samples and additional token for pictures.
It seems to be a really interesting work. At first sight, as for the other self-supervised learning (SSL) approaches, my worries are about the nature of the task.
SSL is mainly carried out for image classification. This way, the model can learn object representations through transformation techniques since there is globally one interesting object per sample. For HTR, a single example contains multiple characters, and each one must be recognized. I think it could be confusing for the representation learning of each character.
1) Training samples must be as various as possible in terms of content (sequence of characters) and/or layout if the unseen data have unconstrained layout. The reading order must be consistent through all the training samples to really learn what it means to read a document
2) I only tested the model on the two HTR datasets presented in the paper.
I think your idea is close to this work.
So yes, it should work too.
have a good day.
@FactoDeepLearning Hi there,
i had a look at the published paper, and correct me if i am wrong:
So DAN is trained on text-line recognition, and based on those learned features it is able to detect and locate those same learned features in document images?
Can DAN recognize text in complex document images that contain: multiple fonts, font sizes, text orientations, slopes, and also pictures, an example would be a newspaper.