Open janzd opened 5 years ago
Does it mean that when creating the gt file with --max-words 3, you should also use the --min-words 3 and create a gt file with only images containing exactly 3 words?
Yes that is what I meant =)
P.S. When the blank label is defined as 0, are spaces dividing words supposed to be 133 or 0 as well?
There should not be any spaces dividing single words anymore (if you are referring to space as a character), as the transform_gt.py
script converts the per line groundtruth to a word based groundtruth.
If you are referring to the blank label used as padding, then I think, it should be zero if you set the blank-label to 0
.
Thanks! I understand it now.
~~There seems to be an issue that made me confused about the blank label and space characters. When I transform the original ground truth using the transform_gt.py
script and provided fsns_char_map.json
, the whitespace characters in the original ground truth file get mapped to character ␢, so the text line doesn't get split into words by the split()
function and the whole text line is treated as one word where spaces are replaced by ␢. For example, Chemin␢de␢la␢Planho
.
I wonder if it's caused by some character encoding differences in different environments. I use Ubuntu 16.04 with default language set to English. When I change "0": 9250
in the character map file to "0": 32
, it gets correctly mapped to a whitespace character. I looked up ascii characters and it says that 9250 is ␢
, and 32 is `. http://ascii-table.com/info.php?u=x2422 http://ascii-table.com/info.php?u=x20 Does your environment really map 9250 in the char map file to a whitespace?~~ Okay. I get it now. In the
prepaing the datasetpart of the readme file, there are steps 1 to 4, so I did it in that order, but you're supposed to do step 4 (swapping class 0 and 133 in the original ground truth file) before you do step 3. After swapping the classes, you use
transform_gt.pyto transform the new ground truth file into word-based ground truth file where each word has 21 labels and
0is used as padding. You also have to specifiy that blank label is
0because the default value in
transform_gt.pyis
133. After that you get for example this result: Image ground truth mapped to a text line:
Chemin de la Planho␢␢␢␢␢␢␢␢␢␢␢␢␢␢␢␢␢␢ Text line split into words:
['Chemin', 'de', 'la', 'Planho'] Words converted into character labels and padded with
0:
[['30', '38', '5', '28', '6', '7', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0'], ['23', '5', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0'], ['1', '20', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0'], ['47', '1', '20', '7', '38', '12', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0']]`
I wrote this after figuring out why it didn't work so that anyone who faces the same problem could refer to this issue if they find it. It might help to swap point 3 and 4 in the corresponding readme part to avoid confusion.
One more question. Does the argument --timesteps
, which defaults to 3
, do anything when I use a curriculum where I have let's say paths to ground truth files containing 2 to 5 words? After reading the code, it seemed to me that the argument is unused and that the current number of timesteps is updated based on the number in the header of the ground truth file. Is that correct?
It might help to swap point 3 and 4 in the corresponding readme part to avoid confusion.
thanks for the hint! I changed it in the README
.
Yes, I think the argument --timesteps
is not used anymore. Maybe I should remove this, too...
Thanks for updating the README.
I've found another likely issue in the code. In baby_step_curriculum.py
, you use deque
collection with maximum length set to 5 to store the validation losses, so the logical behavior would be to store the 5 most recent losses to examine whether the loss has become stagnant. The deque
collection automatically removes the leftmost element (that is the oldest if you use append()
) if it reaches the maximum length; however, you use pop()
to get the loss value to compare with the rest of the values, and pop()
removes the rightmost element, that is the newest element in the queue. That means that only the rightmost element changes and you compare it with the first four loss values that were stored in the deque
collection. And that means that the condition (loss has stagnated) will never be true because you compare the most recent validation loss with the val losses at the beginning of the training; the calculated delta
s will get bigger and bigger as the training proceeds.
You should probably use something like:
reference_value = self.queue[self.maxlen-1]
instead of https://github.com/Bartzi/see/blob/master/chainer/utils/baby_step_curriculum.py#L45
So the code as it is can never get to the point where the dataset gets enlarged, which brings me to another point. After I fixed the above issue and the dataset got enlarged, I ran to an error and because the error message didn't get displayed correctly on the system I used, it took me a while to found out that a KeyError gets raised when you try to enlarge the dataset to 5-word labels because there is an array of four manually set loss weights that depends on the number of detected words in loss_metrics.py
. I get it that you only used instances with 4 words at most as there are not many 5 and 6 word instances, but maybe you could mention in the README that the code as it is cannot be used with more than 4-word labels and people have to add additional loss weights if they want to do so.
In addition, when training process uses up the files in the curriculum and there's not any file level one higher, a KeyError exception is raised by which the training concludes. It's probably just a point of preference but I guess that it's better to end the training with an if-condition rather than throwing an error to end it. It seems as if something has gone wrong when the exception gets raised even though it just means that the training has finished.
Thanks for your feedback!
I've found another likely issue in the code.
Yes, you are right. This definitely is an issue in the code (it also explains why it did not work for me as expected)! I was thinking pop
would give me the left element (as it happens with a normal list)... You said you fixed this, would like to contribute your fix? Btw. I either used the increasedifficuly command (you can just enter this while the training is running), or I first trained only one step and then used the trained model as a base for the second step.
but maybe you could mention in the README that the code as it is cannot be used with more than 4-word labels and people have to add additional loss weights if they want to do so.
Good point! I totally forgot about that! I will add this to the README
.
when training process uses up the files in the curriculum and there's not any file level one higher, a KeyError exception is raised by which the training concludes
Interesting, that should not happen. I thoght that this line prevents such things from happening. Where exactly is this error thrown?
Yes, you are right. This definitely is an issue in the code (it also explains why it did not work for me as expected)! I was thinking
pop
would give me the left element (as it happens with a normal list)... You said you fixed this, would like to contribute your fix? Btw. I either used the increasedifficuly command (you can just enter this while the training is running), or I first trained only one step and then used the trained model as a base for the second step.
Actually, reusing the model and training it step by step or maybe using a for-loop to run the trainer
several times, always with newly loaded dataset, sounds like an easier approach to me than switching the iterators on-the-fly while training is running.
If I'm not mistaken, it was really just that one line I mentioned, but I'll send a PR.
Interesting, that should not happen. I thoght that this line prevents such things from happening. Where exactly is this error thrown?
It threw this when trying to load a new file:
Exception in main training loop:
Traceback (most recent call last):
File "/home/kurapan/miniconda3/envs/chainer32/lib/python3.5/site-packages/chainer/training/trainer.py", line 302, in run
entry.extension(self)
File "/home/kurapan/Code/3rdparty/repos/scene_text/see/chainer/utils/baby_step_curriculum.py", line 86, in __call__
self.enlarge_dataset(trainer)
File "/home/kurapan/Code/3rdparty/repos/scene_text/see/chainer/utils/baby_step_curriculum.py", line 102, in enlarge_dataset
raise StopIteration
Will finalize trainer extensions and updater before reraising the exception.
Traceback (most recent call last):
File "/home/kurapan/Code/3rdparty/repos/scene_text/see/chainer/utils/baby_step_curriculum.py", line 99, in enlarge_dataset
train_dataset, validation_dataset = self.load_dataset(self.current_level)
File "/home/kurapan/Code/3rdparty/repos/scene_text/see/chainer/utils/baby_step_curriculum.py", line 41, in load_dataset
train_dataset = self.dataset_class(self.train_curriculum[level], **self.dataset_args)
KeyError: 5
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "train_fsns.py", line 306, in <module>
trainer.run()
File "/home/kurapan/miniconda3/envs/chainer32/lib/python3.5/site-packages/chainer/training/trainer.py", line 313, in run
six.reraise(*sys.exc_info())
File "/home/kurapan/miniconda3/envs/chainer32/lib/python3.5/site-packages/six.py", line 693, in reraise
raise value
File "/home/kurapan/miniconda3/envs/chainer32/lib/python3.5/site-packages/chainer/training/trainer.py", line 302, in run
entry.extension(self)
File "/home/kurapan/Code/3rdparty/repos/scene_text/see/chainer/utils/baby_step_curriculum.py", line 86, in __call__
self.enlarge_dataset(trainer)
File "/home/kurapan/Code/3rdparty/repos/scene_text/see/chainer/utils/baby_step_curriculum.py", line 102, in enlarge_dataset
raise StopIteration
StopIteration
I'll see if I can figure out a better way to stop it.
Btw. could you tell me what batch size (per GPU or total) and learning rate you used to train your model on the FSNS dataset?
I'll see if I can figure out a better way to stop it.
Yeah, the code raises StopIteration
to signal the end. You could just add a try: except:
block around the call that starts the training, this should help you to make a clean exit.
Btw. could you tell me what batch size (per GPU or total) and learning rate you used to train your model on the FSNS dataset?
Sure, I used a batch size of 20 per GPU and a start learning rate of 1e-6
. You can also see those values, if you have a look at the log file. The first element of the json file contains all configuration parameters.
Sure, I used a batch size of 20 per GPU and a start learning rate of
1e-6
. You can also see those values, if you have a look at the log file. The first element of the json file contains all configuration parameters.
Thanks! I didn't realize I can find it in the log file of the pretrained model.
Before I submit the PR, I want to ask about this method in baby_step_curriculum.py
.
@staticmethod
def split_dataset(dataset):
gpu_datasets = split_dataset_random(dataset, len(dataset) // 2)
#gpu_datasets = split_dataset_n_random(dataset, len(self.gpus))
if not len(gpu_datasets[0]) == len(gpu_datasets[1]):
adapted_second_split = split_dataset(gpu_datasets[1], len(gpu_datasets[0]))[0]
gpu_datasets = (gpu_datasets[0], adapted_second_split)
return gpu_datasets
What is the reason for the method to be static?
I think that when you split the dataset here, you should use split_dataset_n_random
because if you use let's say 4 GPUs, then the dataset gets split into four in train_fsns.py
and four iterators are created, but if you split the new dataset in two, then only two iterators are created. Besides that, the data from the other two iterators from the previous step gets lost. I'd use the commented out line instead, but I'd need self
so I'd have to remove the static method header.
Wow, good catch.
You are totally right. This way of splitting is definitely not a good idea. I think the code is still like this, because I never really really used the curriculum with more than two GPUs at the same time, or I just did not notice such a problem. In the train_fsns
script, the splitting is done correctly (see here).
In the code snippet you show here, you'll also need to be careful to check whether the last split has the same length as the first split, as the second split is not necessarily the last split, it might even happen that there is only one split, if splitted with len(self.gpus)
.
The method is static, because it does not need to access self
(only reason), if you add your line, you can remove the static method decorator, it should not break anything.
In the code snippet you show here, you'll also need to be careful to check whether the last split has the same length as the first split, as the second split is not necessarily the last split, it might even happen that there is only one split, if splitted with
len(self.gpus)
.
I can check the length of the splits the same way as you do in train_fsns.py
, right?
There should never be only one split because split_dataset()
is only called if condition if len(train_iterators) > 1
is met. And train_iterators
matches the number of gpus in train_fsns.py
.
The method is static, because it does not need to access
self
(only reason), if you add your line, you can remove the static method decorator, it should not break anything.
Okay. I'll remove the decorator.
I'll check whether I haven't broken anything and then submit the PR.
I can check the length of the splits the same way as you do in
train_fsns.py
, right?
Exactly :smile:
Thanks for having a look at the code and playing around with it! Highly appreciate that.
Thanks for having a look at the code and playing around with it! Highly appreciate that.
You're welcome. I might use it for my own research so I just do it in my own interest :)
Btw. I wanted to check if other train files besides train_fsns.py
work too after my update, but I ran into an error when running SVHN or textrec. I run into the error even when I use the original code without my changes.
Process _Worker-1:
Traceback (most recent call last):
File "/home/kurapan/miniconda3/envs/chainer32/lib/python3.5/multiprocessing/process.py", line 252, in _bootstrap
self.run()
File "/home/kurapan/miniconda3/envs/chainer32/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 65, in run
gg = gather_grads(self.model)
File "/home/kurapan/miniconda3/envs/chainer32/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 322, in gather_grads
return _gather(link, "grad")
File "/home/kurapan/miniconda3/envs/chainer32/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 289, in _gather
size, num = size_num_grads(link)
File "/home/kurapan/miniconda3/envs/chainer32/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 251, in size_num_grads
if param.size == 0:
File "/home/kurapan/miniconda3/envs/chainer32/lib/python3.5/site-packages/chainer/variable.py", line 666, in size
return self.data.size
AttributeError: 'NoneType' object has no attribute 'size'
Traceback (most recent call last):
File "train_text_recognition.py", line 297, in <module>
trainer.run()
File "/home/kurapan/miniconda3/envs/chainer32/lib/python3.5/site-packages/chainer/training/trainer.py", line 313, in run
six.reraise(*sys.exc_info())
File "/home/kurapan/miniconda3/envs/chainer32/lib/python3.5/site-packages/six.py", line 693, in reraise
raise value
File "/home/kurapan/miniconda3/envs/chainer32/lib/python3.5/site-packages/chainer/training/trainer.py", line 299, in run
update()
File "/home/kurapan/miniconda3/envs/chainer32/lib/python3.5/site-packages/chainer/training/updater.py", line 223, in update
self.update_core()
File "/home/kurapan/miniconda3/envs/chainer32/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 206, in update_core
loss = _calc_loss(self._master, batch)
File "/home/kurapan/miniconda3/envs/chainer32/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 235, in _calc_loss
return model(*in_arrays)
File "/home/kurapan/Code/3rdparty/repos/scene_text/see_temp2/see/chainer/utils/multi_accuracy_classifier.py", line 44, in __call__
self.y = self.predictor(*x)
File "/home/kurapan/Code/3rdparty/repos/scene_text/see_temp2/see/chainer/models/text_recognition.py", line 75, in __call__
h = self.localization_net(images)
File "/home/kurapan/Code/3rdparty/repos/scene_text/see_temp2/see/chainer/models/ic_stn.py", line 80, in __call__
h = self.bn0(self.conv0(images))
File "/home/kurapan/miniconda3/envs/chainer32/lib/python3.5/site-packages/chainer/links/connection/convolution_2d.py", line 156, in __call__
x, self.W, self.b, self.stride, self.pad)
File "/home/kurapan/miniconda3/envs/chainer32/lib/python3.5/site-packages/chainer/functions/connection/convolution_2d.py", line 467, in convolution_2d
y, = fnode.apply(args)
File "/home/kurapan/miniconda3/envs/chainer32/lib/python3.5/site-packages/chainer/function_node.py", line 245, in apply
outputs = self.forward(in_data)
File "/home/kurapan/miniconda3/envs/chainer32/lib/python3.5/site-packages/chainer/function_node.py", line 337, in forward
return self.forward_gpu(inputs)
File "/home/kurapan/miniconda3/envs/chainer32/lib/python3.5/site-packages/chainer/functions/connection/convolution_2d.py", line 158, in forward_gpu
handle = cudnn.get_handle()
File "cupy/cudnn.pyx", line 24, in cupy.cudnn.get_handle
File "cupy/cudnn.pyx", line 32, in cupy.cudnn.get_handle
File "cupy/cuda/cudnn.pyx", line 463, in cupy.cuda.cudnn.create
File "cupy/cuda/cudnn.pyx", line 444, in cupy.cuda.cudnn.check_status
cupy.cuda.cudnn.CuDNNError: CUDNN_STATUS_INTERNAL_ERROR: b'CUDNN_STATUS_INTERNAL_ERROR'
I don't really know how to solve it because I have no clue what might be the cause. I downloaded the corresponding data, created a curriculum file, and tried to run it.
As for train_fsns.py
, it runs without a problem.
Would it be possible for you to check if the other files run too if I send a PR?
Hmm, interesting error. I can think of two different causes:
conv0
layer is not correct (this is actually the more reasonable explanation)But, I can also check this, once you opened the PR =)
Hi! I have a question. Did you actually use the full validation when training the model? I experience a weird bug caused by full validation. When I set the conditions so that the dataset gets enlarged before the first epoch ends, the training progresses without any issue, but when I try to enlarge the dataset in epoch 2 or later, the training process freezes after the dataset gets enlarged. I checked that the code responsible for enlarging gets fully executed, so the issue must occur somewhere in Chainer trainer code. And as it just freezes, it doesn't produce any error traceback that could be used to debug it. Have you experienced something like that?
Honestly, I can't remember. That is too long ago. I think I always used the fast validator during training because the other just needed to much time. It could be that the chainer validator that runs the whole validation set gets stuck in an infinite loop, because the itererator is set to repeat, which it shouldn't be (see this line.
Other than that I don't know what the problem might be. Do you see any GPU usage when it is 'stuck'?
I see. Thanks. I guess I'll drop the full validation then. I've tried playing a bit with the iterator but it always gets stuck somewhere. There seemed to be some GPU usage but I haven't checked that thoroughly. Maybe I'll check it again later.
Hi, I have a question about the ground truth files for FSNS curriculum learning. I've created the ground truth file with images containing up to 2 words using this command:
python transform_gt.py data/train_word-sep_swap.csv fsns_char_map.json data/train_word-sep_swap_max2.csv --max-words 2 --blank-label 0
. I'm not sure if I understand correctly the instructions for the other ground truth files with more words. Specifically this part:Repeat this step with 3 and 4 words (you can also take 5 and 6, too), but make sure to only include images with the corresponding amount of words (--min-words is the flag to use)
Does it mean that when creating the gt file with--max-words 3
, you should also use the--min-words 3
and create a gt file with only images containing exactly 3 words? Will that then train the network only on the 3 word images in that curriculum stage?Thanks in advance
P.S. When the blank label is defined as
0
, are spaces dividing words supposed to be133
or0
as well?