Open harshalcse opened 5 years ago
it is used to build new models from scratch. No finetuning from svhn or fsns models is necessary
what need to take care while preprocessing of dataset.
Hmm, not much I suppose... Just make sure you have already cropped text lines... you don;t really need to create a multistep curriculum for text recognition.
Right now training of SVHN dataset is done but when I used char_map.json because I want to train my custom mode only for alphanumeric characters only
python3 chainer/train_text_recognition.py /root/small_dataset_2/curriculum.json log --blank-label 0 --batch-size 16 --is-trainer-snapshot --use-dropout --char-map /root/small_dataset_2/ctc_char_map.json --gpu 0 --snapshot-interval 1000 --dropout-ratio 0.2 --epoch 200 -lr 0.0001
{
"0": 9250,
"1": 48,
"2": 49,
"3": 50,
"4": 51,
"5": 52,
"6": 53,
"7": 54,
"8": 55,
"9": 56,
"10": 57,
"11": 45,
"12": 65,
"13": 66,
"14": 67,
"15": 68,
"16": 69,
"17": 70,
"18": 71,
"19": 72,
"20": 74,
"21": 75,
"22": 76,
"23": 77,
"24": 78,
"25": 80,
"26": 82,
"27": 83,
"28": 84,
"29": 85,
"30": 86,
"31": 87,
"32": 88,
"33": 89,
"34": 90
}
my gt_word.csv file look like this
17 1
/root/small_dataset_2/9999/0.JPG MRHDG1840KP033812
/root/small_dataset_2/9999/1.JPG MRHRW2840KP060067
/root/small_dataset_2/9999/2.JPG MRHDG1847KP033824
/root/small_dataset_2/9999/3.JPG MRHRW2850KP062158
/root/small_dataset_2/9999/5.JPG MRHDG1840KP032255
/root/small_dataset_2/9999/6.JPG MRHRW6830KP102532
/root/small_dataset_2/9999/7.JPG MRHRU5870KP101363
/root/small_dataset_2/9999/9.JPG MRHRU5850KP100742
/root/small_dataset_2/9999/10.JPG MRHRW1850KP081060
/root/small_dataset_2/9999/11.JPG MRHDG1845KP032378
but got following error
format(optimizer.eps))
Exception in main training loop: '35'
Traceback (most recent call last):
File "/usr/lib/python3.5/site-packages/chainer/training/trainer.py", line 315, in run
update()
File "/usr/lib/python3.5/site-packages/chainer/training/updaters/standard_updater.py", line 165, in update
self.update_core()
File "/usr/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 235, in update_core
loss = _calc_loss(self._master, batch)
File "/usr/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 269, in _calc_loss
return model(*in_arrays)
File "/root/see-master/chainer/utils/multi_accuracy_classifier.py", line 48, in __call__
reported_accuracies = self.accfun(self.y, t)
File "/root/see-master/chainer/metrics/textrec_metrics.py", line 47, in calc_accuracy
word = "".join(map(self.label_to_char, word))
File "/root/see-master/chainer/metrics/loss_metrics.py", line 181, in label_to_char
return chr(self.char_map[str(label)])
Will finalize trainer extensions and updater before reraising the exception.
Traceback (most recent call last):
File "chainer/train_text_recognition.py", line 299, in <module>
trainer.run()
File "/usr/lib/python3.5/site-packages/chainer/training/trainer.py", line 329, in run
six.reraise(*sys.exc_info())
File "/usr/lib/python3.5/site-packages/six.py", line 693, in reraise
raise value
File "/usr/lib/python3.5/site-packages/chainer/training/trainer.py", line 315, in run
update()
File "/usr/lib/python3.5/site-packages/chainer/training/updaters/standard_updater.py", line 165, in update
self.update_core()
File "/usr/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 235, in update_core
loss = _calc_loss(self._master, batch)
File "/usr/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 269, in _calc_loss
return model(*in_arrays)
File "/root/see-master/chainer/utils/multi_accuracy_classifier.py", line 48, in __call__
reported_accuracies = self.accfun(self.y, t)
File "/root/see-master/chainer/metrics/textrec_metrics.py", line 47, in calc_accuracy
word = "".join(map(self.label_to_char, word))
File "/root/see-master/chainer/metrics/loss_metrics.py", line 181, in label_to_char
return chr(self.char_map[str(label)])
KeyError: '35'
Please help me out in that issue .
Remember: The char_map
is only used as a mapping from a predicted class to a character. In order to make the code work with another char_map
that has less classes, you'll need to also adjust the output of the classification layer of the recognition network.
I used different character map file to predict classes but how to adjust output of classification layer of the recognition network. please guide.
Please find following command.
python3 chainer/train_text_recognition.py /root/small_dataset_2/curriculum.json log --blank-label 0 --batch-size 16 --is-trainer-snapshot --use-dropout --char-map /root/small_dataset_2/ctc_char_map.json --gpu 0 --snapshot-interval 1000 --dropout-ratio 0.2 --epoch 200 -lr 0.0001
/usr/lib/python3.5/site-packages/chainer/backends/cuda.py:98: UserWarning: cuDNN is not enabled.
Please reinstall CuPy after you install cudnn
(see https://docs-cupy.chainer.org/en/stable/install.html#install-cudnn).
'cuDNN is not enabled.\n'
/usr/lib/python3.5/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters
/usr/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py:151: UserWarning: optimizer.eps is changed to 1e-08 by MultiprocessParallelUpdater for new batch size.
format(optimizer.eps))
epoch iteration main/loss main/accuracy lr fast_validation/main/loss fast_validation/main/accuracy validation/main/loss validation/main/accuracy
Exception in thread prefetch_loop:............................] 0.15%
multiprocessing.pool.RemoteTraceback: ........................] 30.62%
""" 4 iter, 0 epoch / 200 epochs
Traceback (most recent call last):me to finish: 1:28:18.436435.
File "/usr/lib/python3.5/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/usr/lib/python3.5/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "/usr/lib/python3.5/site-packages/chainer/iterators/multiprocess_iterator.py", line 552, in _fetch_run
data = _fetch_dataset[index]
File "/usr/lib/python3.5/site-packages/chainer/dataset/dataset_mixin.py", line 67, in __getitem__
return self.get_example(index)
File "/root/see-master/chainer/datasets/file_dataset.py", line 144, in get_example
labels = self.get_labels(self.labels[i])
File "/root/see-master/chainer/datasets/file_dataset.py", line 163, in get_labels
labels = [int(self.reverse_char_map[ord(character)]) for character in word]
File "/root/see-master/chainer/datasets/file_dataset.py", line 163, in <listcomp>
labels = [int(self.reverse_char_map[ord(character)]) for character in word]
KeyError: 79
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()
File "/usr/lib/python3.5/threading.py", line 862, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.5/site-packages/chainer/iterators/multiprocess_iterator.py", line 453, in _run
alive = self._task()
File "/usr/lib/python3.5/site-packages/chainer/iterators/multiprocess_iterator.py", line 475, in _task
data_all = future.get(_response_time)
File "/usr/lib/python3.5/multiprocessing/pool.py", line 608, in get
raise self._value
KeyError: 79
/usr/lib/python3.5/site-packages/chainer/iterators/multiprocess_iterator.py:31: TimeoutWarning: Stalled dataset is detected. See the documentation of MultiprocessIterator for common causes and workarounds:.........................] 45.93%
https://docs.chainer.org/en/stable/reference/generated/chainer.iterators.MultiprocessIterator.html
MultiprocessIterator.TimeoutWarning)o finish: 1:26:07.495822.
please help .
Hmm, okay I had a look at the code again. Turns out, the number of classes is already automatically adjusted based on the number of entries in the char_map
. So your last problem is caused by having characters in your groundtruth that are not in the char_map
. For the error before I don't really know why this happens, it shouldn't.
when that error is comming training curriculum has finished. terminating the training process.
When the error KeyError: '35'
comes, you also get training curriculum has finished. terminating the training process.
?
While training of datset is started that time following error comes:
5000 2.07316 0 9.96634e-05 2.04356 0 2.05553 0 $ total [###...............................................] 7.04% this epoch [###...............................................] 7.46% 5000 iter, 14 epoch / 200 epochs 0.12627 iters/sec. Estimated time to finish: 6 days, 1:17:57.009388. enlarging datasets Training curriculum has finished. Terminating the training process.
then halted training of datset . please help.
It seems that the system thinks your training converged enough, so that it can use the second level in the curriculum, which does not exist {look here](https://github.com/Bartzi/see/blob/master/chainer/utils/baby_step_curriculum.py#L82)... you could set the parameter min_delta
of the curriculum to a very small value (like 1e-8
)
I am using min_delta = 1.0
I train model using following command,
python3 chainer/train_text_recognition.py /root/small_dataset_4/curriculum.json log --blank-label 0 --batch-size 16 --is-trainer-snapshot --use-dropout --char-map /root/small_dataset_4/ctc_char_map.json --gpu 0 --snapshot-interval 20000 --dropout-ratio 0.2 --epoch 200 -lr 0.0001
My curriculam json as follows :
[
{
"train": "/root/small_dataset_4/gt_word.csv",
"validation": "/root/small_dataset_4/gt_word.csv"
}
]
Epochs, iterations and batches are as follows
epoch iteration main/loss main/accuracy lr fast_validation/main/loss fast_validation/main/accuracy validation/main/loss validation/main/accuracy
0 100 2.77838 0 3.08566e-05
0 200 2.3851 0 4.25853e-05
0 300 2.33045 0 5.09208e-05
0 400 2.28616 0 5.74294e-05
1 500 2.25634 0 6.27392e-05 2.26484 0
1 600 2.26055 0 6.71828e-05
1 700 2.2739 0 7.0964e-05
1 800 2.23831 0 7.42193e-05
2 900 2.27829 0 7.70463e-05 2.24612 0
2 1000 2.23279 0 7.95176e-05 2.24668 0
2 1100 2.2495 0 8.16892e-05
2 1200 2.24512 0 8.36054e-05
3 1300 2.34768 0 8.53021e-05 2.36744 0
3 1400 2.23765 0 8.68087e-05
3 1500 2.22382 0 8.81497e-05
3 1600 2.21226 0 8.93457e-05
4 1700 2.23709 0 9.04141e-05 2.50032 0
4 1800 2.20482 0 9.13701e-05
4 1900 2.23361 0 9.22265e-05
4 2000 2.18711 0 9.29946e-05 2.17601 0
5 2100 2.1789 0 9.36842e-05 2.17091 0
5 2200 2.16862 0 9.43037e-05
5 2300 2.16191 0 9.48608e-05
5 2400 2.16445 0 9.5362e-05
6 2500 2.15081 0 9.58132e-05 2.14614 0
6 2600 2.14921 0 9.62197e-05
6 2700 2.1292 0 9.6586e-05
6 2800 2.12386 0 9.69162e-05
7 2900 2.12777 0 9.7214e-05 2.14358 0
7 3000 2.12236 0 9.74827e-05 2.09999 0
7 3100 2.11735 0 9.77252e-05
7 3200 2.13403 0 9.7944e-05
8 3300 2.10966 0 9.81416e-05 2.09331 0
8 3400 2.10381 0 9.83201e-05
8 3500 2.14036 0 9.84812e-05
8 3600 2.11325 0 9.86268e-05
8 3700 2.10877 0 9.87584e-05
9 3800 2.1011 0 9.88773e-05 2.11267 0
9 3900 2.09817 0 9.89847e-05
9 4000 2.10695 0 9.90818e-05 2.07527 0
9 4100 2.09598 0 9.91696e-05
10 4200 2.09201 0 9.9249e-05 2.09834 0
10 4300 2.08747 0 9.93207e-05
10 4400 2.09943 0 9.93856e-05
10 4500 2.11838 0 9.94443e-05
11 4600 2.09862 0 9.94973e-05 2.10762 0
11 4700 2.11332 0 9.95453e-05
11 4800 2.10901 0 9.95887e-05
11 4900 2.10108 0 9.96279e-05
12 5000 2.1099 0 9.96634e-05 2.09164 0 2.0996 0
enlarging datasets............................................] 6.08%
Training curriculum has finished. Terminating the training process.62%
5000 iter, 12 epoch / 200 epochs
Please Help to solve this .
till gettting same error as as I tried to run following command
python3 chainer/train_text_recognition.py /root/small_dataset_4/curriculum.json log --blank-label 0 --batch-size 16 --is-trainer-snapshot --use-dropout --char-map /root/small_dataset_4/ctc_char_map.json --gpu 0 --snapshot-interval 20000 --dropout-ratio 0.2 --epoch 200 -lr 0.0001
total [##############################....................] 60.78%
this epoch [#######...........................................] 15.62%
5000 iter, 12 epoch / 20 epochs
0.114 iters/sec. Estimated time to finish: 7:51:40.866016.
enlarging datasets
Training curriculum has finished. Terminating the training process.
please try to set min_delta
to 1e-8
and try again
@Bartzi I set min_delta = 1e-8 in [https://github.com/Bartzi/see/blob/2014359a1489edbbb78f24ddce89383e0078545f/chainer/train_text_recognition.py#L82]( ) then also same issue came.
bboxes are look as follows unable to localize alphanumeric charachters properly so how to achieve more accuracy.
python3 chainer/train_text_recognition.py /root/small_dataset_4/curriculum.json log --blank-label 0 --batch-size 16 --is-trainer-snapshot --use-dropout --char-map /root/small_dataset_4/ctc_char_map.json --gpu 0 --snapshot-interval 20000 --dropout-ratio 0.2 --epoch 200 -lr 0.0001
How many epochs did you train? Did it run for 200 epochs? Can you see any loss improvement?
How large is your dataset?
You could try to increase the batch size, and leave out --is-trainer-snapshot
(you only need this if you want to load a previously trained model), do not use --use-dropout
I ran it on 10 epochs and I see loss improvement but not in very high . My dataset contains images of 4603 images. yes I tried it with batch size of 64 .
It takes some time until things start to get better. Training a model using our approach does not work like training a model on ImageNet or something. The loss takes some time to decrease as first, one model needs to improve its predictions and the other has to get along with that.
Let it train until nothing happens anymore, once you did that you should throw away the model you got for the recognition part and restart the training with initializing the localization network, using the saved params and randomly initializing the recognition model. You can do this over and over again, until even that does not help anymore.
It might also be that the size of your train dataset is not large enough, but I'm not too sure about this.
so approximately how much size of dataset , size of epoch, size of batch required to achieve more accuracy?
Also is it okay to do duplication of dataset to increase dataset for achieving more accuracy ? Please help
At 100 epoch also same issue that Training curriculum has finished. Terminating the training process
is coming
[[J55 5000 4.00388 0 9.96634e-09 3.96396 0 3.96532 0 $ total [##################################................] 69.23% this epoch [###################...............................] 38.25% 5000 iter, 55 epoch / 80 epochs 0.097178 iters/sec. Estimated time to finish: 6:21:10.341421. enlarging datasets Training curriculum has finished. Terminating the training process.
Hi, Still not achieve accuracy as training iterations as follows
python3 chainer/train_text_recognition.py /data/small_dataset_3/curriculum.json log --blank-label 0 -b 256 --is-trainer-snapshot --char-map /data/small_dataset_3/ctc_char_map.json -g 0 -si 1000 -dr 0.2 -e 200 -lr 1e-8 --zoom 0.9 --area-factor 0.0 --area-scale-factor 2 --load-localization
/usr/local/lib/python3.6/dist-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters
/home/qgate/.local/lib/python3.6/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py:151: UserWarning: optimizer.eps is changed to 1e-08 by MultiprocessParallelUpdater for new batch size.
format(optimizer.eps))
epoch iteration main/loss main/accuracy lr fast_validation/main/loss fast_validation/main/accuracy validation/main/loss validation/main/accuracy
3 100 3.97704 0 3.08566e-09 3.98018 0
7 200 3.97642 0 4.25853e-09 3.97597 0
11 300 3.97557 0 5.09208e-09 3.97559 0
15 400 3.97502 0 5.74294e-09 3.97463 0
19 500 3.97436 0 6.27392e-09 3.9742 0
23 600 3.97374 0 6.71828e-09 3.97332 0
27 700 3.97294 0 7.0964e-09 3.97258 0
31 800 3.97227 0 7.42193e-09 3.97214 0
35 900 3.97166 0 7.70463e-09 3.97143 0
38 1000 3.9709 0 7.95176e-09 3.97111 0 3.97076 0
42 1100 3.97021 0 8.16892e-09 3.97022 0
46 1200 3.96958 0 8.36054e-09 3.96959 0
50 1300 3.96892 0 8.53021e-09 3.96887 0
54 1400 3.96808 0 8.68087e-09 3.96827 0
58 1500 3.96746 0 8.81497e-09 3.96742 0
62 1600 3.9669 0 8.93457e-09 3.96659 0
66 1700 3.96627 0 9.04141e-09 3.96625 0
70 1800 3.96535 0 9.13701e-09 3.96522 0
73 1900 3.96485 0 9.22265e-09 3.9648 0
77 2000 3.96412 0 9.29946e-09 3.96353 0 3.96405 0
81 2100 3.96332 0 9.36842e-09 3.96351 0
85 2200 3.96274 0 9.43037e-09 3.96245 0
89 2300 3.96205 0 9.48608e-09 3.96194 0
93 2400 3.9614 0 9.5362e-09 3.96123 0
97 2500 3.96079 0 9.58132e-09 3.96064 0
101 2600 3.96003 0 9.62197e-09 3.95995 0
105 2700 3.95936 0 9.6586e-09 3.95916 0
108 2800 3.95863 0 9.69162e-09 3.95848 0
112 2900 3.95802 0 9.7214e-09 3.95783 0
116 3000 3.95718 0 9.74827e-09 3.9568 0 3.95723 0
120 3100 3.95666 0 9.77252e-09 3.95674 0
124 3200 3.956 0 9.7944e-09 3.95567 0
128 3300 3.95525 0 9.81416e-09 3.95523 0
132 3400 3.95461 0 9.83201e-09 3.95433 0
136 3500 3.95395 0 9.84812e-09 3.95381 0
140 3600 3.95334 0 9.86268e-09 3.95297 0
143 3700 3.95254 0 9.87584e-09 3.95269 0
147 3800 3.95202 0 9.88773e-09 3.95186 0
151 3900 3.95123 0 9.89847e-09 3.95105 0
155 4000 3.95051 0 9.90818e-09 3.95014 0 3.95054 0
159 4100 3.95005 0 9.91696e-09 3.95003 0
163 4200 3.94908 0 9.9249e-09 3.94928 0
167 4300 3.94875 0 9.93207e-09 3.94827 0
171 4400 3.94788 0 9.93856e-09 3.94763 0
175 4500 3.94719 0 9.94443e-09 3.94691 0
178 4600 3.9466 0 9.94973e-09 3.94633 0
182 4700 3.94588 0 9.95453e-09 3.94563 0
186 4800 3.94515 0 9.95887e-09 3.94513 0
total [##############################################....] 93.50%
this epoch [#################################################.] 99.16%
4807 iter, 186 epoch / 200 epochs
Did you have a look at the predictions of the model on a sample image (images in bboxes
folder in log dir)? What happens there over the course of the training?
I also think that your learning rate is way to low. You should use values like 1e-4
and 1e-5
.
@Bartzi I already tried with 1e-4 and 1e-5 but still not achieved
Did you look at the predictions (my first point of the last answer?) Those images are meant as a help to determine what the network does over time. This really helps to debug problems. You can also create an animation out of those image files with the create_video.py
script in the utils
folder.
But still 1e-4
and 1e-5
are the learning rates to use! (maybe also 1e-6
)
@harshalcse, did you get any success on extracting text from images like black on black text?
Is train_text_recognition.py is used only for bulding completely new model from cratch or building extended model from existing svhn dataset trained model or fsns dataset pretrained model ?