Bartzi / see

Code for the AAAI 2018 publication "SEE: Towards Semi-Supervised End-to-End Scene Text Recognition"
GNU General Public License v3.0
574 stars 147 forks source link

Issue when predicting numbers with text detector pre-trained model #49

Open glefundes opened 5 years ago

glefundes commented 5 years ago

Hello,

I'm experimenting with the text detector for license plate reading. When using the provided pre-trained model, the predictions almost always guess the correct letters, but completely flop on the numbers (example below). I suppose it is because the mjsynth dataset is comprised of words and has none or next to no number examples. Is there any way to circumvent this problem? What is the best strategy? I suppose I could use a separate dataset for transfer learning with numbers, but I'm not sure of how this could be done with this model. Example image Result:

OrderedDict([('NJBTITZ', [OrderedDict([('bottom_right', (58.11630630493164, 64.0)), ('top_left', (0.0, 4.503129959106445))]), OrderedDict([('bottom_right', (79.9063949584961, 64.0)), ('top_left', (25.79473304748535, 3.4267578125))]), OrderedDict([('bottom_right', (101.3300552368164, 64.0)), ('top_left', (48.9509391784668, 3.6787776947021484))]), OrderedDict([('bottom_right', (124.90372467041016, 64.0)), ('top_left', (71.83382415771484, 2.9951610565185547))]), OrderedDict([('bottom_right', (147.37278747558594, 64.0)), ('top_left', (93.49320220947266, 2.8387718200683594))]), OrderedDict([('bottom_right', (171.1338348388672, 64.0)), ('top_left', (114.92671966552734, 2.347005844116211))]), OrderedDict([('bottom_right', (194.53089904785156, 64.0)), ('top_left', (136.35043334960938, 2.496673583984375))]), OrderedDict([('bottom_right', (200.0, 64.0)), ('top_left', (156.9724578857422, 4.3870086669921875))]), OrderedDict([('bottom_right', (200.0, 64.0)), ('top_left', (173.5450439453125, 7.3213653564453125))]), OrderedDict([('bottom_right', (200.0, 64.0)), ('top_left', (188.04563903808594, 12.494049072265625))]), OrderedDict([('bottom_right', (200.0, 64.0)), ('top_left', (198.81907653808594, 16.197669982910156))]), OrderedDict([('bottom_right', (200.0, 64.0)), ('top_left', (200.0, 18.550167083740234))]), OrderedDict([('bottom_right', (200.0, 64.0)), ('top_left', (200.0, 19.862049102783203))]), OrderedDict([('bottom_right', (200.0, 64.0)), ('top_left', (200.0, 20.653759002685547))]), OrderedDict([('bottom_right', (200.0, 64.0)), ('top_left', (200.0, 21.19257164001465))]), OrderedDict([('bottom_right', (200.0, 64.0)), ('top_left', (200.0, 21.57888412475586))]), OrderedDict([('bottom_right', (200.0, 64.0)), ('top_left', (200.0, 21.864521026611328))]), OrderedDict([('bottom_right', (200.0, 64.0)), ('top_left', (200.0, 22.08028793334961))]), OrderedDict([('bottom_right', (200.0, 64.0)), ('top_left', (200.0, 22.24560546875))]), OrderedDict([('bottom_right', (200.0, 64.0)), ('top_left', (200.0, 22.373645782470703))]), OrderedDict([('bottom_right', (200.0, 64.0)), ('top_left', (200.0, 22.47317886352539))]), OrderedDict([('bottom_right', (200.0, 64.0)), ('top_left', (200.0, 22.550762176513672))]), OrderedDict([('bottom_right', (200.0, 64.0)), ('top_left', (200.0, 22.611305236816406))])])])

Thank you in advance

EDIT: Ok, I found the --resume option. Sorry about that. Now, I have a couple of questions here (mainly due to the fact that I'm a beginner):

  1. I have an artificial dataset with upwards of 100k images. Would the performance be best using transfer learning on the provided model, or training a new one from scratch?
  2. What's the maximum/optimal batch size I should use with this network? Hardware-wise I can do with some pretty large batches, but as I understand, there's a point when large batch sizes hurt the predicted model.
Bartzi commented 5 years ago

Hi,

I think you are experiencing the problem with the numbers because of two reasons:

  1. the network has not been trained with words that contain many numbers, so it kind of overfitted to not predicting numbers.
  2. As we are using a LSTM for predicting the characters, the implicit language model in the LSTM is not used to predict numbers in such a way it would be necessary for a correct prediction on your data.

So basically what you can do is a retraining or fine-tuning, as you already consider. I think a fine-tuning makes more sense for your amount of data. Trainign from scratch might be quite difficult. Regarding the batch size: I'm not sure what the best/worst batch size is. I think going higher than 128 makes no sense and might also not work, because it used to much space. A very good number is always something around 32 and if possible do not go below 20, because that is bad for the usage of BatchNorm, but apart from that I can not really give you good advice, because we did not do any hyperparameter tuning on the batch size.

I hope that helps ;)

glefundes commented 5 years ago

Hi,

I think you are experiencing the problem with the numbers because of two reasons:

  1. the network has not been trained with words that contain many numbers, so it kind of overfitted to not predicting numbers.
  2. As we are using a LSTM for predicting the characters, the implicit language model in the LSTM is not used to predict numbers in such a way it would be necessary for a correct prediction on your data.

So basically what you can do is a retraining or fine-tuning, as you already consider. I think a fine-tuning makes more sense for your amount of data. Trainign from scratch might be quite difficult. Regarding the batch size: I'm not sure what the best/worst batch size is. I think going higher than 128 makes no sense and might also not work, because it used to much space. A very good number is always something around 32 and if possible do not go below 20, because that is bad for the usage of BatchNorm, but apart from that I can not really give you good advice, because we did not do any hyperparameter tuning on the batch size.

I hope that helps ;)

Thank you so much. As an experiment I trained two models:

  1. from scratch. 20k iterations until loss was around 0.15 and wouldn't go any lower, then more iterations with just the localization weights as you suggested on another post. This got me to around 90% accuracy but the model was still not very robust to image variations such as rotated examples.

  2. transfer learning on the provided model for 15 epochs with lr of 1e-4 (around 20k iterations as well). This was much more effective indeed! the already trained weights could locate characters more accurately (with spatial variations as well) and the recognition network picked up the numbers pretty well too.

Now, I have found a curious problem:

Whenever detecting words with repeated characters (such as ABC-1222), the output of the demo script cuts out the repeated characters and leaves only the first one (the output of the given example would be 'ABC-12'). You can see a real example here:

KFW-2444

OrderedDict([('KFW24', [OrderedDict([('top_left', (0.0, 14.073637008666992)), ('bottom_right', (49.4610710144043, 58.26472473144531))]), OrderedDict([('top_left', (14.820426940917969, 14.722162246704102)), ('bottom_right', (67.1063232421875, 57.84544372558594))]), OrderedDict([('top_left', (36.11481857299805, 15.273567199707031)), ('bottom_right', (88.75239562988281, 57.91472625732422))]), OrderedDict([('top_left', (57.102508544921875, 14.94267463684082)), ('bottom_right', (112.17118835449219, 58.878387451171875))]), OrderedDict([('top_left', (79.5019302368164, 13.340383529663086)), ('bottom_right', (132.2039794921875, 58.51348114013672))]), OrderedDict([('top_left', (103.36588287353516, 12.3841552734375)), ('bottom_right', (154.1340789794922, 58.5925178527832))]), OrderedDict([('top_left', (126.54853057861328, 12.413801193237305)), ('bottom_right', (176.0569305419922, 59.98551940917969))])])])

Bartzi commented 5 years ago

Nice that it worked that well =)

The problem you are experiencing right now is likely due to the fact that we use CTC-Loss for training the recognition network. This is a known problem with CTC because it tries to collapse multiple predictions of the same letter into one letter as long as there is no blank label prediction in between those predictions. You could try to circumvent this by using beam search decoding instead of greedy decoding, or by using independent softmax classifiers for each possible time step.

glefundes commented 5 years ago

Hm, I understand now. As far as I know, Chainer has no native beam search, so that would be a little more demanding to implement. Can you give me any pointers on how would I go about changing the code to use independent classifiers? As I said, I'm a beginner so any light you could shed would be much appreciated.

Bartzi commented 5 years ago

Yes, chainer does not have a native beam search implementation, I also found this unfortunate already^^ Funny thing is, that I was apparently wrong in my last post. The code for training the text recognition model already uses independent softmax classifiers (see this line).

I think your problem lies in this line. In the text recognition demo it is assumed that you used CTC to train the model, but if you did not, the code will strip all repeated character occurences. So if you comment this line, it should work as expected.

As a nice experiment, you could try it with the CTC loss and check what difference it makes. (use this class).

glefundes commented 5 years ago

You were right. Commenting the line and getting "classifications[0]" instead worked perfectly. Thank you! Finally, I'm trying to evaluate my model using the provided script, and while doing so I noticed that while the accuracy is quite satisfactory for optimal cases, the bounding boxes are not the best:

Examples

Are there steps I could take to improve bounding box accuracy/aspect ration in respect to my data?

Edit: To illustrate better why this is a problem The box can sometimes catch 2 characters at once and result in wrong predictions by recognizing the wrong one

Bartzi commented 5 years ago

Nice that it worked!

You are mentioning a problem that I also faced, so far I have not been able to resolve this in a pleasing manner. One of my first thoughts was to to use 'Inverse Compositional Spatial Transformers", but I abandoned the idea because it takes to long and to much memory, although it seems to work. This problem is still an open research question, maybe you can find a way?

santoshmo commented 5 years ago

@glefundes Are you using the train_text_recognition.py script with the model_190000.npz provided by Bartzi?

glefundes commented 5 years ago

@glefundes Are you using the train_text_recognition.py script with the model_190000.npz provided by Bartzi?

Yes. I created my own dataset and used the provided model as a base for transfer learning.

harshalcse commented 5 years ago

@glefundes how you created ground truth csv file ?

glefundes commented 5 years ago

@glefundes how you created ground truth csv file ?

I based mine on the one provided by the author. All you need a column for image paths and a column for ground truths. Dont forget to specify max number of characters/words in the first row as mentioned in the README (step 3 of training preparations).

You can write a simple python script using the csv package to do this automatically, by parsing whatever dataset/annotations you're using.

@Bartzi I don't suppose you've had any new ideas on how to handle the bounding box localization limitation we talked about previously? Just got back to the project I was implementing and I'm looking for a possible solution.

One thought I had was to handle this after the localization step but before the recognition, implementing some low-level image processing filter to refine the bounding boxes to whatever pattern I'm looking for (using CCA, histogram analysis or something like that to detect whether subregions were interesting or not) before passing them on to the recognition net. I don't know if it's possible since the networks are fused, but I'm a real newbie when it comes to chainer hahah.

harshalcse commented 5 years ago

@glefundes I have images in following format IMG_0541

and ground truth file like this: 1 1
IMG_0082.JPG OMRHRW2850KP06041 IMG_0089.JPG MRHRW2850KP060420 IMG_0090.JPG XMRHR XMRHRW2850K IMG_0299.JPG MRHRW18TOKP083013 IMG_0304.JPG MRHRW MRHRW1870KP IMG_0308.JPG MRHRW1870KP083918 IMG_0315.JPG MRHRW1870KP083921 IMG_0319.JPG MRHRW1870KP083921 IMG_0320.JPG MRHRW1870KP083923 IMG_0324.JPG RHRU5830KP0602090 IMG_0327.JPG MRHRU5830KP060210 IMG_0330.JPG INRHRU5S3OKP06020

is it correct ?

Bartzi commented 5 years ago

@glefundes I'm still thinking and working on a better way for that. I think your idea can only work if you can do this bbox adjustment in a differentiable way, such that you are able to backpropagate the gradients from the recognition network to the localization network. But you could also just use this image processing as a post processing step after you run the network. But I think the best way would be to have an additional network or something that helps you with refining the box proposal.

@harshalcse You GT file is not correct. Think about the following:

  1. Count the longest word you have in your dataset and remember that number (let x be that number)
  2. Create the first line of your gt file like this: x1
    • You may ask why: I said in the README that the first line should provide the following information:
      1. The first number gives the number of text lines or words (in this case we will handle each character as an independent word, although this is not the case in reality)
      2. the second number gives the number of characters per word/line. As we use each character as its own word, we only have one character per word, hence we write 1
  3. The rest should be okay like this.
harshalcse commented 5 years ago

@glefundes

I have some samples as follows. try to run following code: python3 chainer/train_text_recognition.py /tmp/new/curriculum.json log --blank-label 0 --batch-size 16 --is-trainer-snapshot --use-dropout --char-map datasets/textrec/ctc_char_map.json --gpu 0 --snapshot-interval 1000 IMG_0540 IMG_0541 IMG_0542 IMG_0543

Error Stack :

Traceback (most recent call last):
  File "chainer/train_text_recognition.py", line 252, in <module>
    test_image = validation_dataset.get_example(0)[0]
  File "/root/see-master/chainer/datasets/file_dataset.py", line 142, in get_example
    labels = self.get_labels(self.labels[i])
  File "/root/see-master/chainer/datasets/file_dataset.py", line 158, in get_labels
    labels = [int(self.reverse_char_map[ord(character)]) for character in word]
  File "/root/see-master/chainer/datasets/file_dataset.py", line 158, in <listcomp>
    labels = [int(self.reverse_char_map[ord(character)]) for character in word]
KeyError: 77

please help

thanks

glefundes commented 5 years ago

@harshalcse This has to do with the amount of different characters in your char map, and if I'm not mistaken, this has to do with it being different than what's expected by the code. I'm not 100% though. Please double check the char map you're using

Bartzi commented 5 years ago

@glefundes is right, your char_map is not correct. It does not know which class to map to the character with the ASCII code 77 which is chr(77) == 'M'. Please have a look at this explanation for more info about the char_map.

harshalcse commented 5 years ago

@glefundes I tried to train 795 images with following script python3 chainer/train_text_recognition.py /tmp/small_dataset/curriculum.json log --blank-label 0 --batch-size 16 --is-trainer-snapshot --use-dropout --char-map /tmp/small_dataset/ctc_char_map.json --gpu 0 --snapshot-interval 1000 but .npz file is not generated in log directory.

please help to generate .npz file.

glefundes commented 5 years ago

@harshalcse the .npz should be generated automatically at intervals defined by the --snapshot-interval argument. I see you defined it as 1000. Please check to see if you are stopping your model earlier than that and try to let it run for longer or reduce the interval.

harshalcse commented 5 years ago

@glefundes what is meaning of --snapshot-interval ?

Bartzi commented 5 years ago

@harshalcse the flag --snapshot-interval gives the interval in which a snaphot is taken. So if you set it to 1000,a snapshot of the current model will be created after 1000 train iterations.

harshalcse commented 5 years ago

right now trainer_snapshot file created inside log directory but when .npz file created still not understood.

Bartzi commented 5 years ago

All I can say is that after snapshot_interval iterations, you should get a snapshot of the model (see this line of code)

harshalcse commented 5 years ago

Right now training of dataset is done but when I used char_map.json because I want to train model only for alphanumeric characters only

{
    "0": 9250,
    "1": 48,
    "2": 49,
    "3": 50,
    "4": 51,
    "5": 52,
    "6": 53,
    "7": 54,
    "8": 55,
    "9": 56,
    "10": 57,
    "11": 45,
    "12": 65,
    "13": 66,
    "14": 67,
    "15": 68,
    "16": 69,
    "17": 70,
    "18": 71,
    "19": 72,
    "20": 74,
    "21": 75,
    "22": 76,
    "23": 77,
    "24": 78,
    "25": 80,
    "26": 82,
    "27": 83,
    "28": 84,
    "29": 85,
    "30": 86,
    "31": 87,
    "32": 88,
    "33": 89,
    "34": 90
}

my gt_word.csv file look like this

17      1
/root/small_dataset_2/9999/0.JPG        MRHDG1840KP033812
/root/small_dataset_2/9999/1.JPG        MRHRW2840KP060067
/root/small_dataset_2/9999/2.JPG        MRHDG1847KP033824
/root/small_dataset_2/9999/3.JPG        MRHRW2850KP062158
/root/small_dataset_2/9999/5.JPG        MRHDG1840KP032255
/root/small_dataset_2/9999/6.JPG        MRHRW6830KP102532
/root/small_dataset_2/9999/7.JPG        MRHRU5870KP101363
/root/small_dataset_2/9999/9.JPG        MRHRU5850KP100742
/root/small_dataset_2/9999/10.JPG       MRHRW1850KP081060
/root/small_dataset_2/9999/11.JPG       MRHDG1845KP032378

but got following error

  format(optimizer.eps))
Exception in main training loop: '35'
Traceback (most recent call last):
  File "/usr/lib/python3.5/site-packages/chainer/training/trainer.py", line 315, in run
    update()
  File "/usr/lib/python3.5/site-packages/chainer/training/updaters/standard_updater.py", line 165, in update
    self.update_core()
  File "/usr/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 235, in update_core
    loss = _calc_loss(self._master, batch)
  File "/usr/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 269, in _calc_loss
    return model(*in_arrays)
  File "/root/see-master/chainer/utils/multi_accuracy_classifier.py", line 48, in __call__
    reported_accuracies = self.accfun(self.y, t)
  File "/root/see-master/chainer/metrics/textrec_metrics.py", line 47, in calc_accuracy
    word = "".join(map(self.label_to_char, word))
  File "/root/see-master/chainer/metrics/loss_metrics.py", line 181, in label_to_char
    return chr(self.char_map[str(label)])
Will finalize trainer extensions and updater before reraising the exception.
Traceback (most recent call last):
  File "chainer/train_text_recognition.py", line 299, in <module>
    trainer.run()
  File "/usr/lib/python3.5/site-packages/chainer/training/trainer.py", line 329, in run
    six.reraise(*sys.exc_info())
  File "/usr/lib/python3.5/site-packages/six.py", line 693, in reraise
    raise value
  File "/usr/lib/python3.5/site-packages/chainer/training/trainer.py", line 315, in run
    update()
  File "/usr/lib/python3.5/site-packages/chainer/training/updaters/standard_updater.py", line 165, in update
    self.update_core()
  File "/usr/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 235, in update_core
    loss = _calc_loss(self._master, batch)
  File "/usr/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 269, in _calc_loss
    return model(*in_arrays)
  File "/root/see-master/chainer/utils/multi_accuracy_classifier.py", line 48, in __call__
    reported_accuracies = self.accfun(self.y, t)
  File "/root/see-master/chainer/metrics/textrec_metrics.py", line 47, in calc_accuracy
    word = "".join(map(self.label_to_char, word))
  File "/root/see-master/chainer/metrics/loss_metrics.py", line 181, in label_to_char
    return chr(self.char_map[str(label)])
KeyError: '35'

Please help me out in that issue .

harshalcse commented 5 years ago

At 100 epoch also same issue that Training curriculum has finished. Terminating the training process is coming

[[J55 5000 4.00388 0 9.96634e-09 3.96396 0 3.96532 0 $ total [##################################................] 69.23% this epoch [###################...............................] 38.25% 5000 iter, 55 epoch / 80 epochs 0.097178 iters/sec. Estimated time to finish: 6:21:10.341421. enlarging datasets Training curriculum has finished. Terminating the training process.

Bartzi commented 5 years ago

You can disable the code that causes this and you won't have that problem anymore...

harshalcse commented 5 years ago

@Bartzi @glefundes Above issue is resolved using increasing batch size to 128 But how to identify that and I just want to train my model on alphanumeric but then also I want to train using ctc_char_map.json from textrec . My modified char_map gives following error.

python3 chainer/train_text_recognition.py /root/small_dataset_51/curriculum.json log --blank-label 0 -b 128 --is-trainer-snapshot --char-map /root/small_dataset_51/ctc_char_map_new.json -g 0 -si 1000 -dr 0.2 -e 5 -lr 1e-8 --zoom 0.9 --area-factor 0.1

Traceback (most recent call last):
  File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.5/threading.py", line 862, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.5/site-packages/chainer/iterators/multiprocess_iterator.py", line 401, in fetch_batch
    batch_ret[0] = [self.dataset[idx] for idx in indices]
  File "/usr/lib/python3.5/site-packages/chainer/iterators/multiprocess_iterator.py", line 401, in <listcomp>
    batch_ret[0] = [self.dataset[idx] for idx in indices]
  File "/usr/lib/python3.5/site-packages/chainer/dataset/dataset_mixin.py", line 67, in __getitem__
    return self.get_example(index)
  File "/root/see-master/chainer/datasets/file_dataset.py", line 144, in get_example
    labels = self.get_labels(self.labels[i])
  File "/root/see-master/chainer/datasets/file_dataset.py", line 163, in get_labels
    labels = [int(self.reverse_char_map[ord(character)]) for character in word]
  File "/root/see-master/chainer/datasets/file_dataset.py", line 163, in <listcomp>
    labels = [int(self.reverse_char_map[ord(character)]) for character in word]
KeyError: 32

Exception in main training loop: 'NoneType' object is not iterable
Traceback (most recent call last):
  File "/usr/lib/python3.5/site-packages/chainer/training/trainer.py", line 315, in run
    update()
  File "/usr/lib/python3.5/site-packages/chainer/training/updaters/standard_updater.py", line 165, in update
    self.update_core()
  File "/usr/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 232, in update_core
    batch = iterator.next()
  File "/usr/lib/python3.5/site-packages/chainer/iterators/multiprocess_iterator.py", line 148, in __next__
    self.dataset_timeout)
  File "/usr/lib/python3.5/site-packages/chainer/iterators/multiprocess_iterator.py", line 417, in measure
    self.mem_size = max(map(_measure, batch))
Will finalize trainer extensions and updater before reraising the exception.
Traceback (most recent call last):
  File "chainer/train_text_recognition.py", line 299, in <module>
    trainer.run()
  File "/usr/lib/python3.5/site-packages/chainer/training/trainer.py", line 329, in run
    six.reraise(*sys.exc_info())
  File "/usr/lib/python3.5/site-packages/six.py", line 693, in reraise
    raise value
  File "/usr/lib/python3.5/site-packages/chainer/training/trainer.py", line 315, in run
    update()
  File "/usr/lib/python3.5/site-packages/chainer/training/updaters/standard_updater.py", line 165, in update
    self.update_core()
  File "/usr/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 232, in update_core
    batch = iterator.next()
  File "/usr/lib/python3.5/site-packages/chainer/iterators/multiprocess_iterator.py", line 148, in __next__
    self.dataset_timeout)
  File "/usr/lib/python3.5/site-packages/chainer/iterators/multiprocess_iterator.py", line 417, in measure
    self.mem_size = max(map(_measure, batch))
TypeError: 'NoneType' object is not iterable

please help

harshsp31 commented 5 years ago

@harshalcse Your data points are very similar. Maybe try increasing the dropout ratio and with a batch size of 64. Also if it's possible, convert your labels to lowercase for fine-tuning so that you can directly use the pre-trained model and then convert them back to uppercase after the predictions.

Bartzi commented 5 years ago

@harshsp31 thanks for bumping this issue. I totally forgot to answer the last question :sweat_smile:

@harshalcse Please have a close look at the error you got. The first excption tells you that one of your words contains a character that converts to the ASCII code 32. If you have a look at a code table, you will see that 32 is the code for the space character. And the space character is apparenlty not in your char_map that is why it does not work. You have two options:

  1. add the space character and any other missing character to your char_map, or
  2. delete this character from your annotations.
harshsp31 commented 5 years ago

@Bartzi I have a small doubt. I know this is not the right place to ask this, but I didn't want to create a separate issue for this. How does the code divide the validation set from the train set if I give the same data folder as train and validation set in curriculum.json?

Bartzi commented 5 years ago

Yes, you are right this might not be the right place to ask this :wink:, but the answer is: it doesn't.