Bartzi / see

Code for the AAAI 2018 publication "SEE: Towards Semi-Supervised End-to-End Scene Text Recognition"
GNU General Public License v3.0
575 stars 147 forks source link

Failed on load weights + Zero Accuracy problems #44

Open rezha130 opened 6 years ago

rezha130 commented 6 years ago

Hi @Bartzi

I already successfully train my custom data set (loss score below 0.01) with this command until last epoch:

python train_text_recognition.py mytrain/curriculum.json log \
--blank-label 0 \
--batch-size 64 \
--is-trainer-snapshot \
--use-dropout \
--char-map mytrain/ctc_char_map.json \
--gpu 0 \
--snapshot-interval 1000 \
--dropout-ratio 0.2 \
--epoch 200 \
-lr 0.0001

then i copy all result files from log to mytrain folder.

But when i try specific npz model file with this command:

python text_recognition_demo.py mytrain model_42000.npz mytrain/image/00001.jpg mytrain/ctc_char_map.json --gpu 0

that command failed on load weights file

File "text_recognition_demo.py", line 158, in <module>
    chainer.serializers.NpzDeserializer(f).load(network)
  File "/home/rezha/miniconda3/lib/python3.6/site-packages/chainer/serializer.py", line 83, in load
    obj.serialize(self)
  File "/home/rezha/miniconda3/lib/python3.6/site-packages/chainer/link.py", line 954, in serialize
    d[name].serialize(serializer[name])
  File "/home/rezha/miniconda3/lib/python3.6/site-packages/chainer/link.py", line 954, in serialize
    d[name].serialize(serializer[name])
  File "/home/rezha/miniconda3/lib/python3.6/site-packages/chainer/link.py", line 612, in serialize
    data = serializer(name, param.data)
  File "/home/rezha/miniconda3/lib/python3.6/site-packages/chainer/serializers/npz.py", line 151, in __call__
    value.set(numpy.asarray(dataset, dtype=value.dtype))
  File "cupy/core/core.pyx", line 1696, in cupy.core.core.ndarray.set
  File "cupy/core/core.pyx", line 1712, in cupy.core.core.ndarray.set
ValueError: Shape mismatch. Old shape: (52,), new shape: (72,)

FYI, before training process..i upgraded to chainer 4.2.0 for enabling cuDNN with cupy-cuda90 4.2.0. Is that a problem?

>>> import chainer
>>> chainer.cuda.available
True
>>> chainer.cuda.cudnn_enabled
True

Please help.

Bartzi commented 6 years ago

@rezha130 I think your problem is this line, you should exchange 52 by 72. You char_map is different to the one I've been using. This problem could be fixed in the same way as done with PR #41.

rezha130 commented 6 years ago

@Bartzi thanks for quick reply, now i can test my model result.

But another problem appear:

python text_recognition_demo.py mytrain model_42000.npz mytrain/image/00001.jpg mytrain/ctc_char_map.json --gpu 0

give this result :

OrderedDict([('Numbers',
              [OrderedDict([('top_left', (10.153651237487793, 0.0)),
                            ('bottom_right', (188.42269897460938, 64.0))]),
               OrderedDict([('top_left', (9.257012367248535, 0.0)),
                            ('bottom_right', (188.95077514648438, 64.0))]),
               OrderedDict([('top_left', (9.751701354980469, 0.0)),
                            ('bottom_right', (189.06959533691406, 64.0))]),
               OrderedDict([('top_left', (16.02237892150879, 0.0)),
                            ('bottom_right', (188.70294189453125, 64.0))]),
               OrderedDict([('top_left', (23.43842315673828, 0.0)),
                            ('bottom_right', (188.17893981933594, 64.0))]),
               OrderedDict([('top_left', (30.188858032226562, 0.0)),
                            ('bottom_right', (187.6661376953125, 64.0))]),
               OrderedDict([('top_left', (35.84349822998047, 0.0)),
                            ('bottom_right', (187.2195281982422, 64.0))]),
               OrderedDict([('top_left', (40.32756805419922, 0.0)),
                            ('bottom_right', (186.85638427734375, 64.0))]),
               OrderedDict([('top_left', (43.758575439453125, 0.0)),
                            ('bottom_right', (186.5736083984375, 64.0))]),
               OrderedDict([('top_left', (46.3254280090332, 0.0)),
                            ('bottom_right', (186.35931396484375, 64.0))]),
               OrderedDict([('top_left', (48.2197265625, 0.0)),
                            ('bottom_right', (186.19967651367188, 64.0))]),
               OrderedDict([('top_left', (49.60652542114258, 0.0)),
                            ('bottom_right', (186.08200073242188, 64.0))]),
               OrderedDict([('top_left', (50.614906311035156, 0.0)),
                            ('bottom_right', (185.99632263183594, 64.0))]),
               OrderedDict([('top_left', (51.347171783447266, 0.0)),
                            ('bottom_right', (185.93399047851562, 64.0))]),
               OrderedDict([('top_left', (51.879066467285156, 0.0)),
                            ('bottom_right', (185.8885955810547, 64.0))])])])

I expect have more words rather than first word.

Bartzi commented 6 years ago

Did you check those 2 lines? And adjust them to your case?

Bartzi commented 6 years ago

Your groundtruth is not necessary for using the demo script, but it looks okay to me.

Your problem is that you are using a script that is designed for printing only one word. I'm not 100% sure but I think that this line, could be the solution. Remove the [0].

rezha130 commented 6 years ago

@Bartzi I remove [0] and get this error:

Traceback (most recent call last):
  File "text_recognition_demo.py", line 181, in <module>
    word = "".join(map(lambda x: chr(char_map[str(x)]), word))
  File "text_recognition_ktp.py", line 181, in <lambda>
    word = "".join(map(lambda x: chr(char_map[str(x)]), word))
KeyError: '[33 28 30]'
mit456 commented 6 years ago

@Bartzi Can we create word based ground thuth file? As @rezha130 has mentioned, till now I have been following the csv structure data, is not that the only way for showing ground thruth to the network?

@rezha130 I am really confused, can you tell me the step you followed to build your custom dataset? would be very grateful

Bartzi commented 6 years ago

If you use train_text_recognition you can use word based ground truth file... oops yeah that is a little different to the other scripts... hmm I'm sorry for that...

rezha130 commented 6 years ago

some differences with time_step = 15 and max_char = 16 from my previous train are in these lines at my create_train.py script :

max_bound_box = "15"
max_chars = "16"
for row in result:
    image_name = row[0]
    label = row[1]
    file.write(os.path.join(train_dir,image_name)+"\t"+ label.replace(" ","\t") +"\n") 

alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz:()[];&+-/'.,0123456789"without white space

Bartzi commented 6 years ago

I remove [0] and get this error:

Traceback (most recent call last): File "text_recognition_demo.py", line 181, in word = "".join(map(lambda x: chr(char_map[str(x)]), word)) File "text_recognition_ktp.py", line 181, in word = "".join(map(lambda x: chr(char_map[str(x)]), word)) KeyError: '[33 28 30]'

You did expect this, didn't you? Once you remove [0] you of course will get an iterable where there was none before... so you'll need to add a loop to the code.

rezha130 commented 6 years ago

Yes @Bartzi, you're right. But i'm pretty sure that model only predict the first word only and neglect all next sequence words (word based gt & tab delimited). It can be shown on rendered bbox images in log/boxes folder, the result is just one first word.

Since you said that train_text_recognition script is designed for recognize only one word, so i try to adjust my custom ground truth files & train approaches with that constraint. Now, rendered bbox images show that model learnt to recognize all defined char_map in image --fyi, first word is like title for specific data values, it always repetitive in every train image set..so model can predict it easily--, but it looks line train process need longer epochs to improve. I can wait for that, since loss score tend to decrease slowly..

Bartzi commented 6 years ago

First thing I see is that the predicted bboxes don't look good at all. They should change positions after a while see the text recognition video from this file.

Furthermore, did you have a close look at the implementation of the dataset loader (here)? Delimiting with tab does not make sense. Sorry, if I misunderstood one of your posts regarding the layout of your groundtruth file.

If you struggle with the groundtruth format, you can also create your own dataset loader! The only thing you need to make sure is that it returns the right data and is a subclass of the DatasetMixin. The expected return value is a tuple with the loaded image and the label converted from characters to classes, using the char_map. Remember to pad each word according to your maximum of characters per word.

rezha130 commented 6 years ago

OK @Bartzi . I think my ground truth file still not correct yet for multi words detector after i check TextRecFileDataset.

Can you please send example of ground truth file that you use for the videos, ecspecially ground truth gt_word.csv files --with example for how to write num_timesteps, num_labels, file_name & labels-- that you used in Text Recognition.mp4 (one word) and FSNS.mp4 (max two words & max three words) using TextRecFileDataset function.

My case basically same with FSNS (detect 2 or 3 text region, than recognize chars in every detected bounding box)

Thank you

Bartzi commented 6 years ago

okay,

  1. TextRecFileDataset is not used for training FSNS data.
  2. here are the first to lines of the text recognition gt_file:
    23  1
    /data/text_recognition/samples/9999/9999026_]kinkiness_-5_DonegalOne-Regular.jpeg   ]kinkiness
  3. here is an FSNS example for training on 3 text regions
    3       21
    /mnt/ssd/christian/data/fsns/images/train/00000/0.png   67      12      11      1       5       26      20      21      23      0       0       0       0       0       0       0       0       0       0       0        0       23      5       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0              0       0       0       73      11      7       5       1       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0

You should have a look at the FSNS examples, train_text_recognition does some things differently!

rezha130 commented 6 years ago

Hi @Bartzi

Now i train with train_fsns.py but i got this error

ValueError: all the input array dimensions except for the concatenation axis must match exactly

What does it means?

Bartzi commented 6 years ago

hmm, hard to say without the stack trace.

But it basically says, that there are some arrays that are concatenated that do not have the correct shape. Could be because of your input data. Did you make sure to input an image that has this dimensions: 600x150?

rezha130 commented 6 years ago

Hi @Bartzi

Input images size is not fixed in train data set. I am using same images data set when train using train_text_recognition which didn't result this kind of error message, but it run until last epoch.

This is my ground truth file with FSNS style

2   16
mytrain/images/0001.jpg 13  11  12  0   0   0   0   0   0   0   0   0   0   0   0   0   4   2   8   3   1   4   7   4   1   3   9   10  1   1   1   7

and char_map.json

{
    "0": 9250,
    "1": 48,
    "2": 49,
    "3": 50,
    "4": 51,
    "5": 52,
    "6": 53,
    "7": 54,
    "8": 55,
    "9": 56,
    "10": 57,
    "11": 73,
    "12": 75,
    "13": 78
}

i train with this command

python train_fsns.py curriculum.json log \
--blank-label 0 \
--batch-size 32 \
--is-trainer-snapshot \
--use-dropout \
--char-map char_map.json \
--gpu 0 \
--snapshot-interval 1000 \
--dropout-ratio 0.2 \
--epoch 100 

please check full stack trace of error below:

/home/rezha/miniconda3/lib/python3.6/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py:150: UserWarning: optimizer.eps is changed to 1e-08 by MultiprocessParallelUpdater for new batch size.
  format(optimizer.eps))
Exception in main training loop: all the input array dimensions except for the concatenation axis must match exactly
Traceback (most recent call last):
  File "/home/rezha/miniconda3/lib/python3.6/site-packages/chainer/training/trainer.py", line 306, in run
    update()
  File "/home/rezha/miniconda3/lib/python3.6/site-packages/chainer/training/updaters/standard_updater.py", line 149, in update
    self.update_core()
  File "/home/rezha/miniconda3/lib/python3.6/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 229, in update_core
    batch = self.converter(batch, self._devices[0])
  File "/home/rezha/miniconda3/lib/python3.6/site-packages/chainer/dataset/convert.py", line 133, in concat_examples
    [example[i] for example in batch], padding[i])))
  File "/home/rezha/miniconda3/lib/python3.6/site-packages/chainer/dataset/convert.py", line 163, in _concat_arrays
    return xp.concatenate([array[None] for array in arrays])
Will finalize trainer extensions and updater before reraising the exception.
Traceback (most recent call last):
  File "train_fsns.py", line 292, in <module>
    trainer.run()
  File "/home/rezha/miniconda3/lib/python3.6/site-packages/chainer/training/trainer.py", line 320, in run
    six.reraise(*sys.exc_info())
  File "/home/rezha/miniconda3/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
  File "/home/rezha/miniconda3/lib/python3.6/site-packages/chainer/training/trainer.py", line 306, in run
    update()
  File "/home/rezha/miniconda3/lib/python3.6/site-packages/chainer/training/updaters/standard_updater.py", line 149, in update
    self.update_core()
  File "/home/rezha/miniconda3/lib/python3.6/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 229, in update_core
    batch = self.converter(batch, self._devices[0])
  File "/home/rezha/miniconda3/lib/python3.6/site-packages/chainer/dataset/convert.py", line 133, in concat_examples
    [example[i] for example in batch], padding[i])))
  File "/home/rezha/miniconda3/lib/python3.6/site-packages/chainer/dataset/convert.py", line 163, in _concat_arrays
    return xp.concatenate([array[None] for array in arrays])
ValueError: all the input array dimensions except for the concatenation axis must match exactly
Bartzi commented 6 years ago

Input images size is not fixed in train data set.

That does not work, because the network is not fully convolutional and because it is not possible to create a batch out of images with different size. It worked with train_text_recogntion.py because there the input images are resized prior to being fed to the network.

The FSNS network expects the images to be of shape 600x150 if that is not the shape your data has, you have to adjust the data loading code (and also the network, as your data is likely to be very different to the original FSNS dataset)!

rezha130 commented 6 years ago

Ok @Bartzi you're right. I resized all my train images to 600x150pixels.

But now i got IndexError: list index out of range in calc_loss :

/home/rezha/miniconda3/lib/python3.6/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py:150: UserWarning: optimizer.eps is changed to 1e-08 by MultiprocessParallelUpdater for new batch size.
  format(optimizer.eps))
Exception in main training loop: list index out of range
Traceback (most recent call last):
  File "/home/rezha/miniconda3/lib/python3.6/site-packages/chainer/training/trainer.py", line 306, in run
    update()
  File "/home/rezha/miniconda3/lib/python3.6/site-packages/chainer/training/updaters/standard_updater.py", line 149, in update
    self.update_core()
  File "/home/rezha/miniconda3/lib/python3.6/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 231, in update_core
    loss = _calc_loss(self._master, batch)
  File "/home/rezha/miniconda3/lib/python3.6/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 262, in _calc_loss
    return model(*in_arrays)
  File "/home/rezha/dataloka/see/chainer/utils/multi_accuracy_classifier.py", line 45, in __call__
    self.loss = self.lossfun(self.y, t)
  File "/home/rezha/dataloka/see/chainer/metrics/loss_metrics.py", line 211, in calc_loss
    overall_loss_weight = loss_weights[i - 1]
Will finalize trainer extensions and updater before reraising the exception.
Traceback (most recent call last):
  File "train_fsns.py", line 292, in <module>
    trainer.run()
  File "/home/rezha/miniconda3/lib/python3.6/site-packages/chainer/training/trainer.py", line 320, in run
    six.reraise(*sys.exc_info())
  File "/home/rezha/miniconda3/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
  File "/home/rezha/miniconda3/lib/python3.6/site-packages/chainer/training/trainer.py", line 306, in run
    update()
  File "/home/rezha/miniconda3/lib/python3.6/site-packages/chainer/training/updaters/standard_updater.py", line 149, in update
    self.update_core()
  File "/home/rezha/miniconda3/lib/python3.6/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 231, in update_core
    loss = _calc_loss(self._master, batch)
  File "/home/rezha/miniconda3/lib/python3.6/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 262, in _calc_loss
    return model(*in_arrays)
  File "/home/rezha/dataloka/see/chainer/utils/multi_accuracy_classifier.py", line 45, in __call__
    self.loss = self.lossfun(self.y, t)
  File "/home/rezha/dataloka/see/chainer/metrics/loss_metrics.py", line 211, in calc_loss
    overall_loss_weight = loss_weights[i - 1]
IndexError: list index out of range

What happen?

mit456 commented 6 years ago

@rezha130 I think, this is becuase you have more than 3 timesteps in your training set?

rezha130 commented 6 years ago

Hi @mit456 thanks for helping

Yes, previousIndexError: list index out of range in calc_loss happen when i am using this ground truth file based on FSNS style

6   22
mytrain/images/01179.jpg    18  31  43  31  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   23  25  22  13  29  5   18  24  19  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0

Is there any maximum limitation of times step & num_labels in using FSNS-like experiment?

rezha130 commented 6 years ago

@Bartzi & @mit456

Can i used my custom my_char_map.json for FSNS-like train in my custom train data set?Or i must used fsns_char_map.json which is already provided?!

Bartzi commented 6 years ago

@rezha130 Before you resized your images to 600x150, did you check that they have the same semantics as the images of the FSNS dataset? This is important!!

Forget about the loss_weights in loss_metrics.py they are not useful for your training. I just used them to make it possible to put some emphasis on certain timesteps of the optimization. Technically there is no limit for num_timesteps and num_labels. You can of course use your custom char_map, but you will need to adapt this line, and change the number of classes you want to distinguish.

rezha130 commented 6 years ago

4d162df0-c59e-4f26-bf32-7c1656653931

After i add label_size as parameter in self.classifier = L.Linear(None, label_size), model can be train.

num_timesteps = 2
num_labels = 16

main/accuracy = 0.5 until last epoch (100 epochs)

@Bartzi ..something strange in bounding box result. Whats happen?

rezha130 commented 6 years ago

Hi @Bartzi

As mention previously, I add label_size as parameter in self.classifier = L.Linear(None, label_size), so model can be train. But if only num_timesteps is 2 or 3!

I'm using this script to get label_size

with open(args.char_map, 'r') as fp:
        char_map = json.load(fp)
label_size = len(char_map)

But if i try to train with another custom training data set which have num_timesteps more than 3, i still got this error:

/home/rezha/miniconda3/lib/python3.6/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py:150: UserWarning: optimizer.eps is changed to 1e-08 by MultiprocessParallelUpdater for new batch size.
  format(optimizer.eps))
Exception in main training loop: list index out of range
Traceback (most recent call last):
  File "/home/rezha/miniconda3/lib/python3.6/site-packages/chainer/training/trainer.py", line 306, in run
    update()
  File "/home/rezha/miniconda3/lib/python3.6/site-packages/chainer/training/updaters/standard_updater.py", line 149, in update
    self.update_core()
  File "/home/rezha/miniconda3/lib/python3.6/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 231, in update_core
    loss = _calc_loss(self._master, batch)
  File "/home/rezha/miniconda3/lib/python3.6/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 262, in _calc_loss
    return model(*in_arrays)
  File "/home/rezha/dataloka/see/chainer/utils/multi_accuracy_classifier.py", line 45, in __call__
    self.loss = self.lossfun(self.y, t)
  File "/home/rezha/dataloka/see/chainer/metrics/loss_metrics.py", line 211, in calc_loss
    overall_loss_weight = loss_weights[i - 1]
Will finalize trainer extensions and updater before reraising the exception.
Traceback (most recent call last):
  File "train_fsns.py", line 292, in <module>
    trainer.run()
  File "/home/rezha/miniconda3/lib/python3.6/site-packages/chainer/training/trainer.py", line 320, in run
    six.reraise(*sys.exc_info())
  File "/home/rezha/miniconda3/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
  File "/home/rezha/miniconda3/lib/python3.6/site-packages/chainer/training/trainer.py", line 306, in run
    update()
  File "/home/rezha/miniconda3/lib/python3.6/site-packages/chainer/training/updaters/standard_updater.py", line 149, in update
    self.update_core()
  File "/home/rezha/miniconda3/lib/python3.6/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 231, in update_core
    loss = _calc_loss(self._master, batch)
  File "/home/rezha/miniconda3/lib/python3.6/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 262, in _calc_loss
    return model(*in_arrays)
  File "/home/rezha/dataloka/see/chainer/utils/multi_accuracy_classifier.py", line 45, in __call__
    self.loss = self.lossfun(self.y, t)
  File "/home/rezha/dataloka/see/chainer/metrics/loss_metrics.py", line 211, in calc_loss
    overall_loss_weight = loss_weights[i - 1]
IndexError: list index out of range

@mit456 do you have some same error experience with num_timesteps more than 3?

@Bartzi please help..

Bartzi commented 6 years ago

I told you to delete the loss_weights, this will fix your problem, but it still won't help you that much.

I ask again, did you have a look at the FSNS dataset? Did you see that there are always 4 views of the same street name sign that are shown at the same time? Your data does not have this property, so you can not use this script without modifications! The fact that your predictions already look quite good, is in my point of view a hint that the network memorizes your data and your data has view variations making it easy for the recognition network to memorize (i.e. overfit)

rezha130 commented 6 years ago

Ok @Bartzi would you please give me the list of py files that i need to modified? At least, i can focus on debugging some of your script, not all of your py files

Bartzi commented 6 years ago

First, you will need to change the network definition and the way predictions are made (1, 2). You will also need to change the way metrics/loss are calculated (1, 2). Furhtermore, you will need to think about, whether you want to use curriculum learning or not, and if you want to plot the current state of the network for each iteration (if you want to do this you might need to make changes in the bbox plotter to, or look whether there is one that is already able to work with your way of making predictions and your way of training).

rezha130 commented 6 years ago

OK @Bartzi

That 4 py file that i will try to modify: 2 files for network definition + 2 files for loss/metric calculation. I need to modify train py script also for that.

Now for bbox plotter: What script on your code if i just want view on SINGLE image with max of 2 or 3 or more than 4 words/ timestep? So i will got plot like this (screenshot from your video), NOT 4 views of the same street name sign that are shown at the same time (in FSNS images) :

09c7517d-d1f9-4507-b962-d2288ea3fff8

Bartzi commented 6 years ago

Sounds good so far :sweat_smile:. You could have a look at all the bbox plotters here, you will see that all special classes inherit from the class BBOXPlotter. A good example could be the SVHN BBOXPlotter.

rezha130 commented 6 years ago

Hi @Bartzi

If i try to set for calc_loss

loss_weights = [1, 1.25, 2, 1.25, 1, 1.25, 2, 1.25, 1, 1.25, 2, 1.25, 1, 1.25, 2, 1.25] 
#16 initial losss weights for max 14 timestep

just specific for the longest semantic of my custom training dataset with max num_timesteps = 14. Am i correct? How do you adjust loss_weights values?

And one more thing, would you please explain what is different objective between image_size & target_shape in your train script? Why for recognition network & BBox Plotter using target_shape, but for loss metric calculation using image_size? Is it ok if i set it with same value? (also using image_size for resizing image)

image_size = Size(width=200, height=40)
target_shape = Size(width=200, height=40)

Btw, for bbox plotter..i'm just using the basic one: bbox_plotter_class = BBOXPlotter

Bartzi commented 6 years ago

I suggest, that you just delete loss_weights from the code, they might come in handy if you need to get mroe accuracy out of the model.

The difference between image_size and target_shape is the following:

I hope this also answers your question, why at one place one value is used and somewhere else another value. So it is not advisable to set them to the same value.

rezha130 commented 6 years ago

@Bartzi now i can train with variety of my custom data set after some modifications in train & inference script, no error. I modified from FSNS examples, but set args.is_original_fsns = False, and loss_weight deleted.

Howefer still the recognition result not as good as expected. As example on this bbox image evaluation result from last epoch, the recognition result is look good:

01718a94-97a9-48ba-b447-e2488ed5ed00

BBox look not so good where i used standard bbox_plotter_class = BBOXPlotter , but log look impressive at last epoch:

{
        "main/loss": 0.28168749809265137,
        "main/accuracy": 0.9703125,
        "validation/main/loss": 0.25905805826187134,
        "validation/main/accuracy": 0.9751243781094526,
        "lr": 9.999999999987483e-05,
        "epoch": 400,
        "iteration": 26700,
        "elapsed_time": 50062.12740638801
    }

but when i try inference model on same image above, i got recognition result: NIK 3175610990006 ---only have 13 numeric chars -- which is different from bbox text (total 16 numeric chars). I try inference on different images, always get 13 numeric chars. I set num_labels = 16when do training. Please help me on this.

And also, how can i set target_shape for recognition network input? As example, i set this input size for image above..is it correct? if not correct yet, what is the correct size for target_shape?

image_size = Size(width=200, height=40)
target_shape = Size(width=120, height=30)

I set timestep = 2because i want 2 bbox for left sentence and right numeric sequence. Please correct me if i'm wrong.

Bartzi commented 6 years ago

I think its working very well for you because your dataset is too easy. You said tghat each transcription starts with the same characters NIK. It is verye asy for the network to memorize this, hence it does not need to locate these characters in order to predict them correctly. The same could be true for your numbers, I think if you'd increase the number of train images and add more variety, the network won't be able to memorize the numbers, just based on some easy features that have been extracted by the network. You could also try to decrease the capacity of your network (i.e. use a network with less parameters).

For your inference problem: Did you check whether the network predicts the correct number of labels, while the blank tokens are not stripped out, yet?

Your target_shape looks good. NUmber of timesteps also seems to be reasonable.

rezha130 commented 6 years ago

Thanks @Bartzi . I checked again my inference script. I found that there is a mistake on me. Now, i got NIK plus 16 numeric character result in inference result.

But i still curious, how i can draw BBox image correctly for evaluation purpose?

Bartzi commented 6 years ago

As I said, the problem is, that the network is able to memorize all data based on easy to extract features, that's why your network does not learn to localize the characters because it is lazy and does not need to. Think of a human that can do a task very easily, but does not it the way you, still he succeeds, you'll likely have to make the task harder for him!

I think that this is the reason, why the BBoxes are not on the characters.

rezha130 commented 6 years ago

@Bartzi

Maybe network is relatively easy to memorize NIK word, but i don't think with next sequence of 16 numeric characters. The train data set is 1300 images (is it enough?), where every image is unique value of sequence number. Those images is ID number, where there're no duplicate ID number value in every record of ground truth file.

FYI, i try also another deep learning algorithm like CRNN --Convolutional Recurrent Neural Net, also with CTC Loss but without STN grid. CRNN network running well in recognition when there are some identical ground truth values for different images. But CRNN was very hard to converge --high loss, near zero accuracy in thousand epoch, even if i try many optimizer algorithm & learning rate value options-- when ground truth value is unique in every record of training data set. SEE network better than CRNN for this case.

PS: CRNN easier to use, because we don't need to adjust additional hyperparameter like input size of recognition network after localization, numbers of maximum localization bbox or numbers of maximum characters per word (even CRNN's network capacity still can not handle more than 26 characters yet)

Bartzi commented 6 years ago

If you take a closer look at the network architecture of SEE, you will see (no pun intended^^) that the network will only achieve good results if, and only if the localization network is able to provide enough information for the recognition network to succeed.

Now take a closer look at the image you provided some posts ago. We can see that the localization network did not localize NIK (the blue bbox), because everythin starts with NIK, very easy for the network to learn. The second localization spans some of the numbers, it seems that this information is enough for the network to correctly identify the rest of the number sequence. This shows us that the task is too easy for the network, mostly because your training set is not large enough (did you try to generate similar looking images with unique sequences?) or the network as such has too many parameters and hence the network is highly overfitting to your data.