ground truth format. - Githubissues

qnkhuat commented 6 years ago

First of all .Thank u for sharing the project.

I want to train model with my own images and wonder what is the format of ground truth

BTW, My text is like "vn324kl21lsfda" which version u suggest me to use? fsns ,svhn or text recognition?

Bartzi commented 6 years ago

Hi,

the format of the groundtruth adheres nearly always to the following structure:

a csv file using tabs as separator (so actually a tsv file)
the first line must consist of two numbers
1. the first number states the number of text regions that the localization network should detect (please note that this should be the maximum number of text regions, so if some samples have less text regions, the model will have to learn to cope with that)
2. the second number states the number of characters in each text region (this is also the maximum number of characters in each text region)
from then on you will have to list the name of each file and the corresponding labels
1. the first column gives the absolute path to the input image
2. the other columns provide the labels for each character. Consider the following image: and the following lines from the groundtruth:
```
3 4
<path to image> 7 5 10 10 2 10 10 10 10 10 10 10
```
  - If you are looking closely you can see that:
  - all images could have a maximum of three text lines and each text line could contain up to 4 characters
  - the image only has two text line and a maximum of two characters per text region
  - that is why the first text region is labeled with 7 5 10 10 this label contains the number as such (75) and also two blank labels telling the network to predict the no character class after the 75. The same is true for the second line. The third line is completely empty. So the network will learn to predict no character for this line.

The method/version you should choose depends on the way your images look like. Are the text lines already cropped from an image? (you should use text recognition) Is it an unprocessed input image with some text lines? You should have a look at FSNS/SVHN and try to work with that code, adepting it to your usecase. Each method actually uses the same network architecture in the background. The main differences are the Sizes of the filters and the preprocessing that needs to be done.

qnkhuat commented 6 years ago

To be sure: my data is cropped images. and it usually has 17 chars. ex 1 image has a text : vn324kl21lsfda so my gt file is. 3 17

v n 3 2 4 k l 2 1 l s f d a 10 10 10 10 10 10 10 10 Am I right here?

Bartzi commented 6 years ago

You are nearly right. For the text recognition model it needs to look like this:

1 17
/home/name/pictures/image_1.jpg vn324kl21lsfda

The code loading the image will prepare the labels for you in this case.

qnkhuat commented 6 years ago

thank u so much. I will try and let u know the result.

qnkhuat commented 6 years ago

Hi. sorry for bother u again. this is my gt file 1 17 /home/dev02/see/datasets/vin/images/background1.png 3j21323123jfdsf

this is my command

python train_text_recognition.py /home/dev2/see/datasets/vin/path.json /home/dev2/see/datasets/vin/log --char-map ../datasets/fsns/fsns_char_map.json -r ../datasets/model/model_190000.npz --blank-label 0 -b 1

and I've got this error

I've checked and I think that my gt file is wrong.

Bartzi commented 6 years ago

Yes you are right! I'm sorry, it should be 17 1 in the first line instead of 1 17, because you are actually predicting 17 text regions with one character each.

But this should not completely fix your problem. The code is complaining, that the input shape is not okay (which so far makes sense), but the second number of 1 != 15 should not be 15 but 17 in your case. I'm not sure why this happens, but it might be that the labels are not correctly padded after loading? Maybe you need to investigate this, too.

qnkhuat commented 6 years ago

Thanks for ur help. I could train now. But While Training I encounter another problem. I've created a new issue here #11

Bartzi / see

ground truth format. #9