Bartzi / stn-ocr

Code for the paper STN-OCR: A single Neural Network for Text Detection and Text Recognition
https://arxiv.org/abs/1707.08831
GNU General Public License v3.0
499 stars 137 forks source link

2 details questions #29

Closed caoyangcr7 closed 5 years ago

caoyangcr7 commented 5 years ago

@Bartzi Sorry to bother you again, I have another 2 questions. first is still about the N, because different training images may have different length of words or characters, so will N change during trainning? When I saw the source code, I found that N was set by num_time_steps param. if N keeps the same during training, so what should we do if N is larger than the length of words or charaters? the second question is about the recognition network,When we get N text regions from the original images after the sample network, how could we find the corresponding label for different text regions during training?for example, we get 2 text regions '16', '18', and we have 2 labels '16', '18',how can we choose label ‘16’ for text regions '16' instead of '18' during the network training? Wish your reply, Thanks.

caoyangcr7 commented 5 years ago

N is the number of stn matrix in paper.

Bartzi commented 5 years ago

Hi,

N does not change during training. It is set to the maximum amount of text regions you want your network to extract. Naturally, it will happen that some words are shorter than N. IN this case you must make sure that the these extra timesteps are labelled with the blank label, so that the network learns to predict the correct number of characters/words in the image.

Let's have a look at your example: Let's assume the 16 is in the top-left corner of the image and the 18 is in the bottom-right corner of the image. Since we always assumed that we read from left to right and top to bottom, we would say that the first label is 16 and the second label is 18. With this way of defining our labels, we tell the network to put the first prediction always close to the most top-left word and all other predictions following the reading direction.

caoyangcr7 commented 5 years ago

@Bartzi,thanks for your reply. Maybe I didn`t explain my question 1 clearly. for question 2,I got it. About the N,it seems that you explain it as CTC loss. My question is that ,for example,now the N is set as 3,it means we can get 3 prediction text regions. if the number of ground truth labels is only 2,like “16” and "18" as above,you mean that we must make sure the label of extra prediction region is blank?Am I right?

Bartzi commented 5 years ago

Yes, that's it!

caoyangcr7 commented 5 years ago

@Bartzi Thanks a lot ! Hope everything goes well with you !