curriculum.json problem

Hi @Bartzi , recently I want to run the train_text_recognization.py and confused about the file curriculum.json. In your README.md file, it says that the template should be:

[ 
    {
        "train": "<path to train file>",
        "validation": "<path to validation file>"
    }
]

The question is, how to describe the path to train file? if I have a file tree structure below,

├── ctc_char_map.json
├── curriculum.json
├── train
│   ├── bg_deep_gray0_0.jpg
│   ├── bg_deep_gray0_1.jpg
│   ├── bg_deep_gray0_2.jpg
│   ├── bg_deep_gray0_3.jpg
│   ├── bg_deep_gray0_4.jpg
│   ├── bg_deep_gray0_5.jpg
│   └── bg_deep_gray0_6.jpg
└── validation
    ├── bg_deep_gray0_7.jpg
    ├── bg_deep_gray0_8.jpg
    └── bg_deep_gray0_9.jpg

should I set the path as

[ 
    {
        "train": "~/Documents/GitHub/see/datasets/textrec/train",
        "validation": "~/Documents/GitHub/see/datasets/textrec/validation"
    }
]

so that SEE will read the whole images in train/validation documents？ And I try to use the download_fsns.py to get download FSNS datasets, I found the structure of FSNS documents is like this, which really confused me beacuse there are so many subdirectory, how should I construct the curriculum.json for FSNS datasets?😂

Looking forward for some suggestions, thanks!

├── train
│   ├── 00000
│   ├── 00001
│   ├── 00002
│   ├── 00003
│   ├── 00004
│   ├── 00005
│   ├── 00006
│   ├── 00007
│   ├── 00008
│   ├── 00009
│   ├── 00010
│   ├── 00011
│   ├── 00012
│   ├── 00013
│   ├── 00014
│   ├── 00015
│   ├── 00016
│   ├── 00017
│   ├── 00018
│   ├── 00019
│   ├── 00020
│   ├── 00021
│   ├── 00022
│   ├── 00023
│   ├── 00024
│   ├── 00025
│   ├── 00026
│   ├── 00027
│   ├── 00028
│   ├── 00029
│   ├── 00030
│   ├── 00031
│   ├── 00032
│   ├── 00033
│   ├── 00034
│   ├── 00035
│   ├── 00036
│   ├── 00037
│   ├── 00038
│   ├── 00039
│   ├── 00040
│   ├── 00041
│   ├── 00042
│   ├── 00043
│   ├── 00044
│   └── 00045
├── train.csv
└── validation
    ├── 00046
    └── 00047

Hi,

with <path to train file> we mean the path to a .csv file that lists all images and their corresponding labels. So if you follow all steps described here you should end up with a .csv that you can put into the curriculum.json file. Once you have this it should be possible to adjust this to fit your use case =)

Thanks for quick answer!

@Bartzi after I follow the instruction and get a train.csv document, I think that should be the path you said to add to the curriculum.json file. And the format likes this, I'm not sure it is correct or not 😂

image/train/00000/0.png 37  26  5   7   11  5   0   23  5   0   67  12  21  24  5   20  11  9   133 133 133 133 133 133 133 133 133 133 133 133 133 133 133 133 133 133 133
image/train/00000/1.png 37  1   1   3   5   0   23  5   8   0   43  20  11  4   8   0   23  5   0   32  5   21  21  5   8   133 133 133 133 133 133 133 133 133 133 133 133
image/train/00000/2.png 49  11  5   0   23  5   8   0   32  5   21  19  12  1   1   6   22  21  5   8   133 133 133 133 133 133 133 133 133 133 133 133 133 133 133 133 133
image/train/00000/3.png 32  65  11  20  21  5   0   23  5   0   30  38  100 4   6   1   1   12  7   133 133 133 133 133 133 133 133 133 133 133 133 133 133 133 133 133 133

Also, I have one more question: if we don't use FSNS dataset and cannot use the code transform_gt.py you provide, how do we get the correct train.csv for new dataset ?

So far so good! There is one thing I forgot to add: You have to add one line to the top of your ground truth file. This is described here (see bullet point 3). This line tells the code how many regions of text you have in your image and how many characters each region of text has as maximum.

If you want to train a model on the FSNS dataset, you would write it like this 6 21 (tab separated). 6 stands for a maximum of 6 text regions, and 21 for the factr that each text region can have a mximum amount of 21 characters.

If you want to train a text recognition model, where you have one region of text (because it is just one word, or just one line), and you have a mximum of 23 characters you would do it like this: 23 1 (tab separated). This seems counter intuitive, but the reasoning here is that you want to predict a bounding box for each character, that is why you have 23 regions and just 1 character per region.

Thanks! @Bartzi, now I understand how to modify the curriculum.json doc. For the line added to the ground truth file, I find I will get an error, the command I entered is: python train_fsns.py ../curriculm.json log --char-map ../datasets/fsns/fsns_char_map.json --blank-label 0 -b 32 the contents of train.csv or validatoin.csv I modified like your suggestion on bullet point 3:

6   21
image/train/00000/0.png 37  26  5   7   11  5   0   23  5   0   67  12  21  24  5   20  11  9   133 133 133 133 133 133 133 133 133 133 133 133 133 133 133 133 133 133 133
image/train/00000/1.png 37  1   1   3   5   0   23  5   8   0   43  20  11  4   8   0   23  5   0   32  5   21  21  5   8   133 133 133 133 133 133 133 133 133 133 133 133
.
.
.

then the result is

Traceback (most recent call last):
  File "train_fsns.py", line 83, in <module>
    train_dataset, validation_dataset = curriculum.load_dataset(0)
  File "/home/lxt/Github/see/chainer/utils/baby_step_curriculum.py", line 38, in load_dataset
    train_dataset = self.dataset_class(self.train_curriculum[level], **self.dataset_args)
  File "/home/lxt/Github/see/chainer/datasets/file_dataset.py", line 31, in __init__
    self.num_timesteps, self.num_labels = (float(i) for i in next(reader))
  File "/home/lxt/Github/see/chainer/datasets/file_dataset.py", line 31, in <genexpr>
    self.num_timesteps, self.num_labels = (float(i) for i in next(reader))
ValueError: could not convert string to float: '6\t21'

I cannot get a good way to solve this so I manually change the self.num_timesteps, self.num_labels as 6 21, then the result is

Traceback (most recent call last):
  File "train_fsns.py", line 168, in <module>
    updater = MultiprocessParallelUpdater(train_iterators, optimizer, devices=args.gpus)
  File "/home/lxt/anaconda3/lib/python3.6/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 140, in __init__
    assert len(iterators) == len(devices)
AssertionError

I guess maybe that is becasue the setting of GPU device numbers and MultiGPU setting. So can I change the code manually or is there something wrong with my train.csv file？

Did you change any code in chainer/datasets/file_dataset.py? The Code shown in your traceback does not look like the code in the repo. This will explain the ValueError you are encountering.

Regarding your groundtruth file:

I think your first line should look like 37 1, as you apparently want to do text recognition with only one text line, am I right? Explanation: You have a maximum of 37 characters (yes I counted them) and only one line of characters. If not could you provide a sample image? I could help you better with this.
you are using --blank-label 0, looking at your groundtruth file, I think that this should be 133, because you are padding with this value.

You are getting your last error, because you did not specify a GPU to train on, you should do this in order to avoid such errors.

Unfortunately, @Bartzi I am still stuck on this problem 😢. I try to remove the whole local file and git clone again the code from your repo and still get the problem,

Traceback (most recent call last):
  File "train_fsns.py", line 84, in <module>
    train_dataset, validation_dataset = curriculum.load_dataset(0)
  File "/home/lxt/Github/see/chainer/utils/baby_step_curriculum.py", line 38, in load_dataset
    train_dataset = self.dataset_class(self.train_curriculum[level], **self.dataset_args)
  File "/home/lxt/Github/see/chainer/datasets/file_dataset.py", line 31, in __init__
    self.num_timesteps, self.num_labels = (int(i) for i in next(reader))
  File "/home/lxt/Github/see/chainer/datasets/file_dataset.py", line 31, in <genexpr>
    self.num_timesteps, self.num_labels = (int(i) for i in next(reader))
ValueError: invalid literal for int() with base 10: '2\t21'

I think the problem happens when it want to value the num_timesteps and lables, I am confused about (int(i) for i in next(reader)) why will you use int(i) and it seems the program cannot seperate '2\t21'. I don't know why 😂.

For the dataset, I use the FSNS dataset ( the train images file is too large so I only use Test images but I rearrange them as Train and Validation documents). and the image example like the figure below: After follow your instruction, which is download_fsns.py, then tfrecord_to_image.py, next is python transform_gt.py <path to downloaded gt> fsns_char_map.json <path to 2 word gt> --max-words 2 --blank-label 0, I got the first line of .csv file is 2 21(tab seperated), but when I count the number of charater, it has 42 maximum characters:

2   21
tf_image/train/00000/0.png  37  26  5   7   11  5   0   23  5   0   67  12  21  24  5   20  11  9   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
tf_image/train/00000/2.png  49  11  5   0   23  5   8   0   32  5   21  19  12  1   1   6   22  21  5   8   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
tf_image/train/00000/3.png  32  65  11  20  21  5   0   23  5   0   30  38  100 4   6   1   1   12  7   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0

For the blank-label value 0, I know I can use swap_classes.py to change 0 to 133. The reason why I try to train FSNS is because I want to train my own datasets and I want to know how to correctly build the train.csv and validation.csv file. One more thing, if I use transform_gt.py with different maximum words value(like 2, 3, 4, 5, 6 words), how should I build the curriculum.json file becasue I have a number of train.csv file.

Okay, now I see, you are trying to work on the FSNS dataset.

We use int(i) because we need to transform the string we read from the file to an integer. The csv reader should already split the ine into a list of two values. This seems weird and should not happen.

I think your .csv looks wrong because you are setting --blank-label to the wrong value. I think it should be 133 (the default).

If you have different train.csv files, you put them into the array in the json file like:

[
  {
    "train": "<path to first csv (usually the easiest)>",
    "validation": "<path to first csv>"
  },
  {
    "train": "<path to second csv (a lttile bit more difficult)>",
    "validation": "<path to second csv>"
  }
]

Hi, @Bartzi, Fortunately, I solve the read file error, I found that if I download the .csv file from my server and opened it with my local MS Excel, the format of .csv will change so that the code cannot separate the content in .csv successfully. For training, I create train.csv and validation.csv with maximum word of 2, I got a new error, the command I enter is: python train_fsns.py curriculum.json log --char-map ../datasets/fsns/fsns_char_map.json --blank-label 0 -b 32 -g 1 3 the error I got is:

/home/lxt/anaconda3/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)
/home/lxt/anaconda3/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)
/home/lxt/anaconda3/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)
/home/lxt/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
/home/lxt/anaconda3/lib/python3.6/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py:150: UserWarning: optimizer.eps is changed to 2e-08 by MultiprocessParallelUpdater for new batch size.
  format(optimizer.eps))
epoch       iteration   main/loss   main/accuracy  lr          fast_validation/main/loss  fast_validation/main/accuracy  validation/main/loss  validation/main/accuracy
Process _Worker-1:............................................]  0.00%
Traceback (most recent call last):............................]  0.02%
  File "cupy/cuda/memory.pyx", line 810, in cupy.cuda.memory.SingleDeviceMemoryPool._malloc    inf iters/sec. Estimated time to finish: 0:00:00.
  File "cupy/cuda/memory.pyx", line 735, in cupy.cuda.memory.SingleDeviceMemoryPool._alloc
  File "cupy/cuda/memory.pyx", line 423, in cupy.cuda.memory._malloc
  File "cupy/cuda/memory.pyx", line 424, in cupy.cuda.memory._malloc
  File "cupy/cuda/memory.pyx", line 58, in cupy.cuda.memory.Memory.__init__
  File "cupy/cuda/runtime.pyx", line 212, in cupy.cuda.runtime.malloc
  File "cupy/cuda/runtime.pyx", line 135, in cupy.cuda.runtime.check_status
cupy.cuda.runtime.CUDARuntimeError: cudaErrorMemoryAllocation: out of memory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "cupy/cuda/memory.pyx", line 816, in cupy.cuda.memory.SingleDeviceMemoryPool._malloc
  File "cupy/cuda/memory.pyx", line 735, in cupy.cuda.memory.SingleDeviceMemoryPool._alloc
  File "cupy/cuda/memory.pyx", line 423, in cupy.cuda.memory._malloc
  File "cupy/cuda/memory.pyx", line 424, in cupy.cuda.memory._malloc
  File "cupy/cuda/memory.pyx", line 58, in cupy.cuda.memory.Memory.__init__
  File "cupy/cuda/runtime.pyx", line 212, in cupy.cuda.runtime.malloc
  File "cupy/cuda/runtime.pyx", line 135, in cupy.cuda.runtime.check_status
cupy.cuda.runtime.CUDARuntimeError: cudaErrorMemoryAllocation: out of memory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "cupy/cuda/memory.pyx", line 822, in cupy.cuda.memory.SingleDeviceMemoryPool._malloc
  File "cupy/cuda/memory.pyx", line 735, in cupy.cuda.memory.SingleDeviceMemoryPool._alloc
  File "cupy/cuda/memory.pyx", line 423, in cupy.cuda.memory._malloc
  File "cupy/cuda/memory.pyx", line 424, in cupy.cuda.memory._malloc
  File "cupy/cuda/memory.pyx", line 58, in cupy.cuda.memory.Memory.__init__
  File "cupy/cuda/runtime.pyx", line 212, in cupy.cuda.runtime.malloc
  File "cupy/cuda/runtime.pyx", line 135, in cupy.cuda.runtime.check_status
cupy.cuda.runtime.CUDARuntimeError: cudaErrorMemoryAllocation: out of memory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/lxt/anaconda3/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/home/lxt/anaconda3/lib/python3.6/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 61, in run
    loss = _calc_loss(self.model, batch)
  File "/home/lxt/anaconda3/lib/python3.6/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 262, in _calc_loss
    return model(*in_arrays)
  File "/home/lxt/Github/see/chainer/utils/multi_accuracy_classifier.py", line 44, in __call__
    self.y = self.predictor(*x)
  File "/home/lxt/Github/see/chainer/models/fsns.py", line 525, in __call__
    return self.recognition_net(images, h)
  File "/home/lxt/Github/see/chainer/models/fsns_resnet.py", line 44, in __call__
    h = self.resnet(rois, layers=['res5', 'pool5'])
  File "/home/lxt/anaconda3/lib/python3.6/site-packages/chainer/links/model/vision/resnet.py", line 199, in __call__
    h = func(h)
  File "/home/lxt/anaconda3/lib/python3.6/site-packages/chainer/links/model/vision/resnet.py", line 551, in __call__
    x = l(x)
  File "/home/lxt/anaconda3/lib/python3.6/site-packages/chainer/links/model/vision/resnet.py", line 596, in __call__
    h1 = self.bn3(self.conv3(h1))
  File "/home/lxt/anaconda3/lib/python3.6/site-packages/chainer/links/normalization/batch_normalization.py", line 144, in __call__
    running_var=self.avg_var, decay=decay)
  File "/home/lxt/anaconda3/lib/python3.6/site-packages/chainer/functions/normalization/batch_normalization.py", line 718, in batch_normalization
    (x, gamma, beta))[0]
  File "/home/lxt/anaconda3/lib/python3.6/site-packages/chainer/function_node.py", line 258, in apply
    outputs = self.forward(in_data)
  File "/home/lxt/anaconda3/lib/python3.6/site-packages/chainer/functions/normalization/batch_normalization.py", line 155, in forward
    y = cuda.cupy.empty_like(x)
  File "/home/lxt/anaconda3/lib/python3.6/site-packages/cupy/creation/basic.py", line 41, in empty_like
    return cupy.ndarray(a.shape, dtype=dtype)
  File "cupy/core/core.pyx", line 96, in cupy.core.core.ndarray.__init__
  File "cupy/cuda/memory.pyx", line 468, in cupy.cuda.memory.alloc
  File "cupy/cuda/memory.pyx", line 964, in cupy.cuda.memory.MemoryPool.malloc
  File "cupy/cuda/memory.pyx", line 985, in cupy.cuda.memory.MemoryPool.malloc
  File "cupy/cuda/memory.pyx", line 765, in cupy.cuda.memory.SingleDeviceMemoryPool.malloc
  File "cupy/cuda/memory.pyx", line 828, in cupy.cuda.memory.SingleDeviceMemoryPool._malloc
cupy.cuda.memory.OutOfMemoryError: out of memory to allocate 124518400 bytes (total 11891251200 bytes)

I think I am very close to run the FSNS training program 😄, and I have multuple GPU with 11GB memory to use but I don't know how to use mulitple GPU in command ? Also, I try to reduce the batch size to 8 and the command is python train_fsns.py curriculum.json log --char-map ../datasets/svhn/svhn_char_map.json --blank-label 0 -b 8 -g 1 . Then I got new error is:

/home/lxt/anaconda3/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)
/home/lxt/anaconda3/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)
/home/lxt/anaconda3/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)
/home/lxt/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
/home/lxt/anaconda3/lib/python3.6/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py:150: UserWarning: optimizer.eps is changed to 1e-08 by MultiprocessParallelUpdater for new batch size.
  format(optimizer.eps))
Exception in main training loop: '82'
Traceback (most recent call last):
  File "/home/lxt/anaconda3/lib/python3.6/site-packages/chainer/training/trainer.py", line 306, in run
    update()
  File "/home/lxt/anaconda3/lib/python3.6/site-packages/chainer/training/updaters/standard_updater.py", line 149, in update
    self.update_core()
  File "/home/lxt/anaconda3/lib/python3.6/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 231, in update_core
    loss = _calc_loss(self._master, batch)
  File "/home/lxt/anaconda3/lib/python3.6/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 262, in _calc_loss
    return model(*in_arrays)
  File "/home/lxt/Github/see/chainer/utils/multi_accuracy_classifier.py", line 48, in __call__
    reported_accuracies = self.accfun(self.y, t)
  File "/home/lxt/Github/see/chainer/metrics/loss_metrics.py", line 254, in calc_accuracy
    word = "".join(map(self.label_to_char, word))
  File "/home/lxt/Github/see/chainer/metrics/loss_metrics.py", line 181, in label_to_char
    return chr(self.char_map[str(label)])
Will finalize trainer extensions and updater before reraising the exception.
Traceback (most recent call last):
  File "train_fsns.py", line 292, in <module>
    trainer.run()
  File "/home/lxt/anaconda3/lib/python3.6/site-packages/chainer/training/trainer.py", line 320, in run
    six.reraise(*sys.exc_info())
  File "/home/lxt/anaconda3/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
  File "/home/lxt/anaconda3/lib/python3.6/site-packages/chainer/training/trainer.py", line 306, in run
    update()
  File "/home/lxt/anaconda3/lib/python3.6/site-packages/chainer/training/updaters/standard_updater.py", line 149, in update
    self.update_core()
  File "/home/lxt/anaconda3/lib/python3.6/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 231, in update_core
    loss = _calc_loss(self._master, batch)
  File "/home/lxt/anaconda3/lib/python3.6/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 262, in _calc_loss
    return model(*in_arrays)
  File "/home/lxt/Github/see/chainer/utils/multi_accuracy_classifier.py", line 48, in __call__
    reported_accuracies = self.accfun(self.y, t)
  File "/home/lxt/Github/see/chainer/metrics/loss_metrics.py", line 254, in calc_accuracy
    word = "".join(map(self.label_to_char, word))
  File "/home/lxt/Github/see/chainer/metrics/loss_metrics.py", line 181, in label_to_char
    return chr(self.char_map[str(label)])
KeyError: '82'

Do you have any suggestions and if I want to build a train.csv and validation.csv with my own dataset, would you like to give me some guide to do this work ? Thank you so much for answering my questions with helpful suggestions these days ! 👍

Yes, I do have suggestions ;)

First, using --gpu 1 3 is not really a good idea, as you are using a lot of GPU bandwidth for nothing, it is better to use consecutive GPU numbers, you can also do everything with just one GPU at a time. Second, you already figured out that a batch size of 32 seems too high ;) Third, you got your error because you used the wrong char_map file. you should use fsns_char_map.json, they have to be created for each use case.

This brings us to your dataset question:

What kind of data do you have? Are you using data that is similar to the FSNS dataset, or are you using data that is more similar to a text recognition dataset with already cropped words? It highly depends on that.
Once we figured that out, we can talk about creating train and validation files, because the approaches qre a little different for each case.

Hi, @Bartzi, I found that the second command I use the SVHN char_map so that it asserts keyerror, and finally I successfully run the FSNS training! Excited about it. The estimation of training should be 5 days, and I will check the results. For the own datasets, there is some example of and this data are not like FSNS dataset which have four different angles of one street sign, it is just one image, like normal text recognition. So I think I have two questions:

how to build the char_map?
- Because may be in the further I will train some Chinese image dataset and I cannot use normal English char_map anymore, so I should build my own char_map. I used to build a char_dict and ord_map for CRNN model text recognition but not an end-end model like SEE. Part of the char_dict and ord_map looks like below:
```
{
"21834": "\u554a",
"38463": "\u963f",
"22467": "\u57c3",
"25384": "\u6328",
"21710": "\u54ce",
"21769": "\u5509",
"21696": "\u54c0",
"30353": "\u7691",
"30284": "\u764c",
"34108": "\u853c",
.
.
.
}
```
```
{
"0": "21834", 
"1": "38463", 
"2": "22467", 
"3": "25384", 
"4": "21710", 
"5": "21769", 
"6": "21696", 
.
.
.
}
```
  so could I use the original char_map from other model for training SEE text recognition?
  1. how to build the groundtruth file?
- Another problem for training a new dataset is to build a groundtruth file. I know the groundtruth file contains the path of image and the label of image. But I don't know how to build the correct label format for training and validation.

Looking forward for your suggestions. Thank you very much! 👍

Alright,

good to know that it worked for you!

Your own datasets look promising, but I have to say that it might be difficult to make SEE work on these images. I say this because they contain quite many text regions. You'll need to carefully design a curriculum learning strategy in order to successfully train a model on these images. This means, you should start with image, containing only one word, then go with images containing more than one word and so forth. You might even need to change some more things of the approach (but that is where research kicks in).

How to build a `char_map`

You already have something very nice that is similar to the char_map we've used. I think you can reuse it, you will just need to change some portions of our code and you should be good to go. The most important thing is, is that the key in your char_map corresponds to a class the network has as output. But that should aready be the case for your data, as far as I can see it. If you want to create a new char_map, have a look at this comment, where I explain a char map in detail. Don't forget the blank char, that resembles the prediction of no character.

Groundtruth

The groundtruth format you need is quite similar to the groundtruth format for the FSNS dataset. Each line in the groundtruth file consists of the annotation for one image (obviously). The first column is the absolute path of the image on your machine. The remaining columns are the classes for each character of your prediction. Here you have to be careful. As we are predicting several words and each word has a different length, we have to pad the annotation for each word to the same length! That means: If the longest word in your dataset has a length of 23 characters, each word in your groundtruth file, needs to have a length of 23 characters (for a word with 10 characters this means that you write the 10 characters of the word (the classes of the characters of course) and then add 13 blank character classes). Once you are done with the first word, you add the second word to the line and so on.

Metadata in the first line

You already know about the metadata line in the first row of the groundtruth file. This line is mainly important for training under a curriculum, as it provides the necessary information to the code on how to change the padding of the groundtruth, once the difficulty is increased.

I hope that makes sense to you so far!

HI, @Bartzi, first thanks for your guidance and next I would ask some quesions I met these days 😂. I successfully trained my own dataset by using train_text_recognition.py. But I still have some problems with the 'groundtruth' file. At first, I try to build a 'CSV' file like FSNS dataset, part of the example shows below:

28  28
/home/lxt/Github/see/datasets/textrec/predata_crnn_notoken_train/crnn_pre_bg_img_notoken/bg_deep_green0_ch_0.jpg    1757    327 1519    1360    3290    3290    3290    3290    3290    3290    3290    3290    3290    3290    3290    3290    3290    3290    3290    3290    3290    3290    3290    3290    3290    3290    3290
/home/lxt/Github/see/datasets/textrec/predata_crnn_notoken_train/crnn_pre_bg_img_notoken/bg_deep_green0_ch_1.jpg    772 556 887 1227    1791    3290    3290    3290    3290    3290    3290    3290    3290    3290    3290    3290    3290    3290    3290    3290    3290    3290    3290    3290    3290    3290    3290
/home/lxt/Github/see/datasets/textrec/predata_crnn_notoken_train/crnn_pre_bg_img_notoken/bg_deep_green0_ch_2.jpg    1191    603 1599    25  1001    3290    3290    3290    3290    3290    3290    3290    3290    3290    3290    3290    3290    3290    3290    3290    3290    3290    3290    3290    3290    3290    3290
/home/lxt/Github/see/datasets/textrec/predata_crnn_notoken_train/crnn_pre_bg_img_notoken/bg_deep_green0_ch_3.jpg    140 876 3290    3290    3290    3290    3290    3290    3290    3290    3290    3290    3290    3290    3290    3290    3290    3290    3290    3290    3290    3290    3290    3290    3290    3290    3290

Because the image I use contains many chinese characters so the number looks so big and the number 3290 is the blank label which like the 133 for FSNS datasets. However, this type of groundtruth file cannot work with some bugs. After I read a #18 then change both the num_timesteps and num_labels to 28

num_labels maybe means the maximum length of one word but I am still confused with the num_timesteps actually. Because in your paper's demo figure, I see 4 bbox and it also set 4 iterations in your code, do the num_timesteps means the maximum number of the bounding box or the maximum number of words in the image or other meanings?

and went to your homepage then downloaded an example of textrec dataset, I change my groundtruth file looks like this:

28  28
/home/lxt/Github/see/datasets/textrec/predata_crnn_notoken_train/crnn_pre_bg_img_notoken/bg_deep_green0_ch_0.jpg    携着樱花
/home/text/Github/see/datasets/textrec/predata_crnn_notoken_train/crnn_pre_bg_img_notoken/bg_deep_green0_ch_1.jpg   纯真的孩子
/home/lxt/Github/see/datasets/textrec/predata_crnn_notoken_train/crnn_pre_bg_img_notoken/bg_deep_green0_ch_2.jpg    在高原之上
/home/lxt/Github/see/datasets/textrec/predata_crnn_notoken_train/crnn_pre_bg_img_notoken/bg_deep_green0_ch_3.jpg    泥坑

which means I set metadata to equal and don't need to convert the label into corresponding numbers one by one. Luckily, the second version of groundtruth file makes train_text_recognition works correctly. The reason why the second version of groundtruth file can work I think is in your original train_text_recognition.py. When I read the code in see/chainer/datasets/file_datasets.py, there are some differences in two classes FileBasedDataset, TextRecFileDataset and you use TextRecFileDataset in the code. For TextRecFileDataset in line 122 and for FileBasedDataset in line35, they have the different way to read data from the file, one only read line[1]

for line in reader:
    file_name = line[0]
    labels = line[1]
    self.file_names.append(file_name)
    self.labels.append(labels)

and another read line[1:]

for line in reader:
    file_name = line[0]
    labels = np.array(line[1:], dtype=np.int32)
    self.file_names.append(file_name)
    self.labels.append(labels)

This is the first question which confused me: does this mean for one-line recognition, I use TextRecFileDataset and for multiple-line recognition, I should use FileBasedDataset? The second question is about multi-character recognition. Finally, I can train my own dataset with SEE but it only contains one line which also is cropped. I think the STN part of SEE has very few contributions for the recognition for this kind of image, in other words, I think it wastes the SEE talents on end-to-end recognition. So what if the image contains multiple words to recognize, how to build the second version of the groundtruh file? I notice that the FSNS dataset use a blank label to separate two characters, such as 'AVENUE', 'de' and 'BROCEAUX'. So for my own dataset, Should I do the same thing?, but I'm afraid I will meet errors because even one line images it cannot work. 😂

tf_image/train/00000/0.png  37  26  5   7   11  5   0   23  5   0   67  12  21  24  5   20  11  9   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0

Thanks for your patience and suggestions! 👍 Best wishes

Oh, that is something I forgot to mention. I changed the way the groundtruth file should look like for training a model on pure text recognition. But you already figured that out by yourself ;).

num_timesteps is the maximum amount of bounding boxes you have in your image
num_labels is the maximum amount of characters per bounding box

So setting both values to 28 does not make much sense for text recognition. You actually want to have num_timesteps at 28 and num_labels at 1, because you want to have a bounding box for each character and each of these bounding boxes is supposed to contain max. one character.

And yes, the two dataset readers are different just because of the way you want to do the recognition. I found it easier to read the dataset with this code. It should be possible to do some changes and also prepare the groundtruth files for the other datasets in the same way. So if ou do not want to change the code, you will need to use TextRecFileDataset for one line recognition and FileBasedDataset for multi line recognition.

Recognition on one line does not "waste the talent of SEE". This is because SEE predicts a bounding box for each character in the crop and then recognizes this single character. Imagine you have a crop that is not perfectly aligned with the text and the text is bend or in other ways deformed. Using the spatial transformer, we are now able to correct these errors and still get a good recognition accuracy. That was basically the idea behind these experiments.

The original FSNS dataset supplied the annotation for the images as one line of text. During our experiments we saw that SEE can not handle this type of annotation. That is why we splitted the annotation into single words. As we want to use batches during training we have to make sure that the length of each word is the same, so we are padding each word to have the same length. As we also want to have images with only one word and images with more than one word, we also need to add a padding so that the number of words is the same for each image, although most of the annotation will just be the blank label.

Does this make any sense to you?

Hi, @Bartzi, thanks for your explanation! I used to misunderstand your meaning, like for the demo figure in your paper

you set num_timesteps like 4 and num_labels as 8 or 9 maybe I guess. so actually the bounding box numbers can be set free but not a fixed values. (because of the DEMO figures, I used to think that for SEE it only has maximum 4 bounding boxes. 😂 ) Recently, I try to play around SEE on FSNS and the Small English word datasets from your homepage(Because if I cannot reach the similar result in same datasets of your paper, I cannot transfer and implement the model to other datasets). However, the accuracy of these two datasets is both 0. For FSNS dataset, the output log file is

[
    {
        "main/loss": 38.32195281982422,
        "main/accuracy": 0.49,
        "lr": 0.0030856589376553494,
        "epoch": 0,
        "iteration": 100,
        "elapsed_time": 183.68570357607678,
        "log_dir": "log/2018-08-15T09:13:03.554987_training",
        "image_size": [
            150,
            150
        ],
        "target_size": [
            75,
            100
        ],
        "localization_net": [
            "FSNSSingleSTNLocalizationNet",
            "fsns.py"
        ],
        "recognition_net": [
            "FSNSRecognitionResnet",
            "fsns_resnet.py"
        ],
        "fusion_net": [
            "FSNSNet",
            "fsns.py"
        ],
        "area_factor": 0,
        "area_scale_factor": 2,
        "aspect_factor": 0,
        "batch_size": 8,
        "blank_label": 0,
        "char_map": "../datasets/fsns/fsns_char_map.json",
        "dataset_specification": "curriculum.json",
        "dropout_ratio": 0.5,
        "epochs": 10,
        "freeze_localization": false,
        "gpus": [
            3
        ],
        "is_original_fsns": true,
        "is_trainer_snapshot": false,
        "learning_rate": 0.01,
        "learning_rate_step_size": 0.1,
        "load_localization": false,
        "load_recognition": false,
        "log_interval": 100,
        "log_name": "training",
        "no_log": true,
        "optimize_all_interval": 5,
        "port": 1337,
        "resume": null,
        "send_bboxes": false,
        "snapshot_interval": 20000,
        "test_image": null,
        "test_interval": 1000,
        "test_iterations": 200,
        "timesteps": 3,
        "use_dropout": false,
        "zoom": 0.9
    },
    {
        "main/loss": 34.28860092163086,
        "main/accuracy": 0.475,
        "lr": 0.004258534616241294,
        "epoch": 0,
        "iteration": 200,
        "elapsed_time": 367.35264407796785
    },
    {
        "main/loss": 33.9091911315918,
        "main/accuracy": 0.45,
        "lr": 0.00509208177314456,
        "epoch": 0,
        "iteration": 300,
        "elapsed_time": 546.1001480510458
    },
    {
        "main/loss": 33.40351867675781,
        "main/accuracy": 0.47,
        "lr": 0.0057429443144893875,
        "epoch": 0,
        "iteration": 400,
        "elapsed_time": 724.4937521819957
    },
    {
        "main/loss": 33.40519714355469,
        "main/accuracy": 0.445,
        "lr": 0.006273922657626689,
        "epoch": 0,
        "iteration": 500,
        "elapsed_time": 901.4223207770847
    },

.
.
.
.
    {
        "main/loss": NaN,
        "main/accuracy": 0.0,
        "lr": 0.01,
        "epoch": 2,
        "iteration": 100600,
        "elapsed_time": 183295.8161171181
    },
    {
        "main/loss": NaN,
        "main/accuracy": 0.0,
        "lr": 0.01,
        "epoch": 2,
        "iteration": 100700,
        "elapsed_time": 183463.77702326095
    },
    {
        "main/loss": NaN,
        "main/accuracy": 0.0,
        "lr": 0.01,
        "epoch": 2,
        "iteration": 100800,
        "elapsed_time": 183632.05294044595
    },
    {
        "main/loss": NaN,
        "main/accuracy": 0.0,
        "lr": 0.01,
        "epoch": 2,
        "iteration": 100900,
        "elapsed_time": 183800.3940849309
    },
    {
        "main/loss": NaN,
        "main/accuracy": 0.0,
        "lr": 0.01,
        "fast_validation/main/loss": NaN,
        "fast_validation/main/accuracy": 0.0,
        "epoch": 2,
        "iteration": 101000,
        "elapsed_time": 184047.21759898728
    }
]

and for small English word dataset, the output log file is

[
    {
        "main/loss": 1.5535175800323486,
        "main/accuracy": 0.0,
        "validation/main/loss": 1.4019248485565186,
        "validation/main/accuracy": 0.0,
        "lr": 0.0030856589376553494,
        "epoch": 7,
        "iteration": 100,
        "elapsed_time": 95.50005288701504,
        "log_dir": "log/small_en/2018-08-28T15:09:58.413071_training",
        "image_size": [
            64,
            200
        ],
        "target_size": [
            50,
            50
        ],
        "localization_net": [
            "InverseCompositionalLocalizationNet",
            "ic_stn.py"
        ],
        "recognition_net": [
            "TextRecognitionNet",
            "text_recognition.py"
        ],
        "fusion_net": [
            "TextRecNet",
            "text_recognition.py"
        ],
        "area_factor": 0,
        "area_scale_factor": 2,
        "aspect_factor": 0,
        "batch_size": 60,
        "blank_label": 0,
        "char_map": "../datasets/small_dataset/ctc_char_map.json",
        "dataset_specification": "../datasets/small_dataset/curriculum.json",
        "dropout_ratio": 0.5,
        "epochs": 20,
        "freeze_localization": false,
        "gpus": [
            0
        ],
        "is_trainer_snapshot": false,
        "learning_rate": 0.01,
        "learning_rate_step_size": 0.1,
        "load_localization": false,
        "load_recognition": false,
        "log_interval": 100,
        "log_name": "training",
        "no_log": true,
        "num_processes": null,
        "optimize_all_interval": 5,
        "port": 1337,
        "refinement": false,
        "refinement_steps": 1,
        "render_all_bboxes": false,
        "resume": null,
        "send_bboxes": false,
        "snapshot_interval": 20000,
        "test_image": null,
        "test_interval": 1000,
        "test_iterations": 200,
        "timesteps": 3,
        "use_dropout": false,
        "use_serial_iterator": false,
        "zoom": 0.9
    },
    {
        "main/loss": 1.3228240013122559,
        "main/accuracy": 0.0,
        "validation/main/loss": 0.9676698446273804,
        "validation/main/accuracy": 0.0,
        "lr": 0.004258534616241294,
        "epoch": 15,
        "iteration": 200,
        "elapsed_time": 190.4235925329849
    }
]

I didn't make any changes of your original code, for small English words, I changed the learning ratelr or zoom like issues #8, but it still not works. Is there anything wrong with my model or my settings ?

The small English words dataset is not intended to be used for training at all, since it is just way too small. I just added it in order to provide an example groundtruth file, so I'm not surprised that it is not working for you. You can try to use the [http://www.robots.ox.ac.uk/~vgg/data/text/] (Synthetic Word) dataset, provided by Max Jaderberg. You just need to adjust the groundtruth to the format supplied in the example dataset.

For FSNS: So far it looks quite okay. How does your curriculum look like? How did you perform the training? A learning rate of 0.01 is way to high! You should always use a learning rate close to 1e-4, 1e-5, or 1e-6, otherwise you will get NaN at some point, because the network diverges. Although NaN might also occur if there is a division by zero at some point! But if you are using the supplied code, this should not happen anymore...

Does that mean I cannot use SEE for training small datasets which only contains hundreds or thousands of images... Or can I train an SEE model then transfer to a small dataset with fine-tuning?

Depends on how difficult the task is. Fine-tuning an existing model might, but I know that SEE did not converge (when trained from scratch) on this very small example dataset with only a few hundred images.

It is not said that the approach as such is not working for small datasets, it was just not my focus while doing work on that, so it might still be possible.

Thanks for your help! @Bartzi and I think I have no problem in the further so I will close this issue.

Bartzi / see