FSNS curriculum learning gt files

janzd commented 5 years ago

Hi, I have a question about the ground truth files for FSNS curriculum learning. I've created the ground truth file with images containing up to 2 words using this command: python transform_gt.py data/train_word-sep_swap.csv fsns_char_map.json data/train_word-sep_swap_max2.csv --max-words 2 --blank-label 0. I'm not sure if I understand correctly the instructions for the other ground truth files with more words. Specifically this part: Repeat this step with 3 and 4 words (you can also take 5 and 6, too), but make sure to only include images with the corresponding amount of words (--min-words is the flag to use) Does it mean that when creating the gt file with --max-words 3, you should also use the --min-words 3 and create a gt file with only images containing exactly 3 words? Will that then train the network only on the 3 word images in that curriculum stage?

Thanks in advance

P.S. When the blank label is defined as 0, are spaces dividing words supposed to be 133 or 0 as well?

Bartzi commented 5 years ago

Does it mean that when creating the gt file with --max-words 3, you should also use the --min-words 3 and create a gt file with only images containing exactly 3 words?

Yes that is what I meant =)

P.S. When the blank label is defined as 0, are spaces dividing words supposed to be 133 or 0 as well?

There should not be any spaces dividing single words anymore (if you are referring to space as a character), as the transform_gt.py script converts the per line groundtruth to a word based groundtruth. If you are referring to the blank label used as padding, then I think, it should be zero if you set the blank-label to 0.

janzd commented 5 years ago

Thanks! I understand it now. ~~There seems to be an issue that made me confused about the blank label and space characters. When I transform the original ground truth using the transform_gt.py script and provided fsns_char_map.json, the whitespace characters in the original ground truth file get mapped to character ␢, so the text line doesn't get split into words by the split() function and the whole text line is treated as one word where spaces are replaced by ␢. For example, Chemin␢de␢la␢Planho. I wonder if it's caused by some character encoding differences in different environments. I use Ubuntu 16.04 with default language set to English. When I change "0": 9250 in the character map file to "0": 32, it gets correctly mapped to a whitespace character. I looked up ascii characters and it says that 9250 is ␢, and 32 is `. http://ascii-table.com/info.php?u=x2422 http://ascii-table.com/info.php?u=x20 Does your environment really map 9250 in the char map file to a whitespace?~~ Okay. I get it now. In theprepaing the datasetpart of the readme file, there are steps 1 to 4, so I did it in that order, but you're supposed to do step 4 (swapping class 0 and 133 in the original ground truth file) before you do step 3. After swapping the classes, you usetransform_gt.pyto transform the new ground truth file into word-based ground truth file where each word has 21 labels and0is used as padding. You also have to specifiy that blank label is0because the default value intransform_gt.pyis133. After that you get for example this result: Image ground truth mapped to a text line:Chemin de la Planho␢␢␢␢␢␢␢␢␢␢␢␢␢␢␢␢␢␢ Text line split into words:['Chemin', 'de', 'la', 'Planho'] Words converted into character labels and padded with0:[['30', '38', '5', '28', '6', '7', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0'], ['23', '5', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0'], ['1', '20', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0'], ['47', '1', '20', '7', '38', '12', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0']]`

I wrote this after figuring out why it didn't work so that anyone who faces the same problem could refer to this issue if they find it. It might help to swap point 3 and 4 in the corresponding readme part to avoid confusion.

janzd commented 5 years ago

One more question. Does the argument --timesteps, which defaults to 3, do anything when I use a curriculum where I have let's say paths to ground truth files containing 2 to 5 words? After reading the code, it seemed to me that the argument is unused and that the current number of timesteps is updated based on the number in the header of the ground truth file. Is that correct?

Bartzi commented 5 years ago

It might help to swap point 3 and 4 in the corresponding readme part to avoid confusion.

thanks for the hint! I changed it in the README.

Yes, I think the argument --timesteps is not used anymore. Maybe I should remove this, too...

janzd commented 5 years ago

Thanks for updating the README.

I've found another likely issue in the code. In baby_step_curriculum.py, you use deque collection with maximum length set to 5 to store the validation losses, so the logical behavior would be to store the 5 most recent losses to examine whether the loss has become stagnant. The deque collection automatically removes the leftmost element (that is the oldest if you use append()) if it reaches the maximum length; however, you use pop() to get the loss value to compare with the rest of the values, and pop() removes the rightmost element, that is the newest element in the queue. That means that only the rightmost element changes and you compare it with the first four loss values that were stored in the deque collection. And that means that the condition (loss has stagnated) will never be true because you compare the most recent validation loss with the val losses at the beginning of the training; the calculated deltas will get bigger and bigger as the training proceeds. You should probably use something like: reference_value = self.queue[self.maxlen-1] instead of https://github.com/Bartzi/see/blob/master/chainer/utils/baby_step_curriculum.py#L45 So the code as it is can never get to the point where the dataset gets enlarged, which brings me to another point. After I fixed the above issue and the dataset got enlarged, I ran to an error and because the error message didn't get displayed correctly on the system I used, it took me a while to found out that a KeyError gets raised when you try to enlarge the dataset to 5-word labels because there is an array of four manually set loss weights that depends on the number of detected words in loss_metrics.py. I get it that you only used instances with 4 words at most as there are not many 5 and 6 word instances, but maybe you could mention in the README that the code as it is cannot be used with more than 4-word labels and people have to add additional loss weights if they want to do so. In addition, when training process uses up the files in the curriculum and there's not any file level one higher, a KeyError exception is raised by which the training concludes. It's probably just a point of preference but I guess that it's better to end the training with an if-condition rather than throwing an error to end it. It seems as if something has gone wrong when the exception gets raised even though it just means that the training has finished.

Bartzi commented 5 years ago

Thanks for your feedback!

I've found another likely issue in the code.

Yes, you are right. This definitely is an issue in the code (it also explains why it did not work for me as expected)! I was thinking pop would give me the left element (as it happens with a normal list)... You said you fixed this, would like to contribute your fix? Btw. I either used the increasedifficuly command (you can just enter this while the training is running), or I first trained only one step and then used the trained model as a base for the second step.

but maybe you could mention in the README that the code as it is cannot be used with more than 4-word labels and people have to add additional loss weights if they want to do so.

Good point! I totally forgot about that! I will add this to the README.

when training process uses up the files in the curriculum and there's not any file level one higher, a KeyError exception is raised by which the training concludes

Interesting, that should not happen. I thoght that this line prevents such things from happening. Where exactly is this error thrown?

janzd commented 5 years ago

Yes, you are right. This definitely is an issue in the code (it also explains why it did not work for me as expected)! I was thinking pop would give me the left element (as it happens with a normal list)... You said you fixed this, would like to contribute your fix? Btw. I either used the increasedifficuly command (you can just enter this while the training is running), or I first trained only one step and then used the trained model as a base for the second step.

Actually, reusing the model and training it step by step or maybe using a for-loop to run the trainer several times, always with newly loaded dataset, sounds like an easier approach to me than switching the iterators on-the-fly while training is running. If I'm not mistaken, it was really just that one line I mentioned, but I'll send a PR.

Interesting, that should not happen. I thoght that this line prevents such things from happening. Where exactly is this error thrown?

It threw this when trying to load a new file:

Exception in main training loop: 
Traceback (most recent call last):
  File "/home/kurapan/miniconda3/envs/chainer32/lib/python3.5/site-packages/chainer/training/trainer.py", line 302, in run
    entry.extension(self)
  File "/home/kurapan/Code/3rdparty/repos/scene_text/see/chainer/utils/baby_step_curriculum.py", line 86, in __call__
    self.enlarge_dataset(trainer)
  File "/home/kurapan/Code/3rdparty/repos/scene_text/see/chainer/utils/baby_step_curriculum.py", line 102, in enlarge_dataset
    raise StopIteration
Will finalize trainer extensions and updater before reraising the exception.
Traceback (most recent call last):
  File "/home/kurapan/Code/3rdparty/repos/scene_text/see/chainer/utils/baby_step_curriculum.py", line 99, in enlarge_dataset
    train_dataset, validation_dataset = self.load_dataset(self.current_level)
  File "/home/kurapan/Code/3rdparty/repos/scene_text/see/chainer/utils/baby_step_curriculum.py", line 41, in load_dataset
    train_dataset = self.dataset_class(self.train_curriculum[level], **self.dataset_args)
KeyError: 5

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train_fsns.py", line 306, in <module>
    trainer.run()
  File "/home/kurapan/miniconda3/envs/chainer32/lib/python3.5/site-packages/chainer/training/trainer.py", line 313, in run
    six.reraise(*sys.exc_info())
  File "/home/kurapan/miniconda3/envs/chainer32/lib/python3.5/site-packages/six.py", line 693, in reraise
    raise value
  File "/home/kurapan/miniconda3/envs/chainer32/lib/python3.5/site-packages/chainer/training/trainer.py", line 302, in run
    entry.extension(self)
  File "/home/kurapan/Code/3rdparty/repos/scene_text/see/chainer/utils/baby_step_curriculum.py", line 86, in __call__
    self.enlarge_dataset(trainer)
  File "/home/kurapan/Code/3rdparty/repos/scene_text/see/chainer/utils/baby_step_curriculum.py", line 102, in enlarge_dataset
    raise StopIteration
StopIteration

I'll see if I can figure out a better way to stop it.

Btw. could you tell me what batch size (per GPU or total) and learning rate you used to train your model on the FSNS dataset?

Bartzi commented 5 years ago

I'll see if I can figure out a better way to stop it.

Yeah, the code raises StopIteration to signal the end. You could just add a try: except: block around the call that starts the training, this should help you to make a clean exit.

Btw. could you tell me what batch size (per GPU or total) and learning rate you used to train your model on the FSNS dataset?

Sure, I used a batch size of 20 per GPU and a start learning rate of 1e-6. You can also see those values, if you have a look at the log file. The first element of the json file contains all configuration parameters.

janzd commented 5 years ago

Sure, I used a batch size of 20 per GPU and a start learning rate of 1e-6. You can also see those values, if you have a look at the log file. The first element of the json file contains all configuration parameters.

Thanks! I didn't realize I can find it in the log file of the pretrained model.

Before I submit the PR, I want to ask about this method in baby_step_curriculum.py.

@staticmethod
    def split_dataset(dataset):
        gpu_datasets = split_dataset_random(dataset, len(dataset) // 2)
        #gpu_datasets = split_dataset_n_random(dataset, len(self.gpus))
        if not len(gpu_datasets[0]) == len(gpu_datasets[1]):
            adapted_second_split = split_dataset(gpu_datasets[1], len(gpu_datasets[0]))[0]
            gpu_datasets = (gpu_datasets[0], adapted_second_split)
        return gpu_datasets

What is the reason for the method to be static? I think that when you split the dataset here, you should use split_dataset_n_random because if you use let's say 4 GPUs, then the dataset gets split into four in train_fsns.py and four iterators are created, but if you split the new dataset in two, then only two iterators are created. Besides that, the data from the other two iterators from the previous step gets lost. I'd use the commented out line instead, but I'd need self so I'd have to remove the static method header.

Bartzi commented 5 years ago

Wow, good catch. You are totally right. This way of splitting is definitely not a good idea. I think the code is still like this, because I never really really used the curriculum with more than two GPUs at the same time, or I just did not notice such a problem. In the train_fsns script, the splitting is done correctly (see here).

In the code snippet you show here, you'll also need to be careful to check whether the last split has the same length as the first split, as the second split is not necessarily the last split, it might even happen that there is only one split, if splitted with len(self.gpus).

The method is static, because it does not need to access self (only reason), if you add your line, you can remove the static method decorator, it should not break anything.

janzd commented 5 years ago

In the code snippet you show here, you'll also need to be careful to check whether the last split has the same length as the first split, as the second split is not necessarily the last split, it might even happen that there is only one split, if splitted with len(self.gpus).

I can check the length of the splits the same way as you do in train_fsns.py, right? There should never be only one split because split_dataset() is only called if condition if len(train_iterators) > 1 is met. And train_iterators matches the number of gpus in train_fsns.py.

The method is static, because it does not need to access self (only reason), if you add your line, you can remove the static method decorator, it should not break anything.

Okay. I'll remove the decorator.

I'll check whether I haven't broken anything and then submit the PR.

Bartzi commented 5 years ago

I can check the length of the splits the same way as you do in train_fsns.py, right?

Exactly :smile:

Thanks for having a look at the code and playing around with it! Highly appreciate that.

janzd commented 5 years ago

Thanks for having a look at the code and playing around with it! Highly appreciate that.

You're welcome. I might use it for my own research so I just do it in my own interest :)

Btw. I wanted to check if other train files besides train_fsns.py work too after my update, but I ran into an error when running SVHN or textrec. I run into the error even when I use the original code without my changes.

Process _Worker-1:
Traceback (most recent call last):
  File "/home/kurapan/miniconda3/envs/chainer32/lib/python3.5/multiprocessing/process.py", line 252, in _bootstrap
    self.run()
  File "/home/kurapan/miniconda3/envs/chainer32/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 65, in run
    gg = gather_grads(self.model)
  File "/home/kurapan/miniconda3/envs/chainer32/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 322, in gather_grads
    return _gather(link, "grad")
  File "/home/kurapan/miniconda3/envs/chainer32/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 289, in _gather
    size, num = size_num_grads(link)
  File "/home/kurapan/miniconda3/envs/chainer32/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 251, in size_num_grads
    if param.size == 0:
  File "/home/kurapan/miniconda3/envs/chainer32/lib/python3.5/site-packages/chainer/variable.py", line 666, in size
    return self.data.size
AttributeError: 'NoneType' object has no attribute 'size'
Traceback (most recent call last):
  File "train_text_recognition.py", line 297, in <module>
    trainer.run()
  File "/home/kurapan/miniconda3/envs/chainer32/lib/python3.5/site-packages/chainer/training/trainer.py", line 313, in run
    six.reraise(*sys.exc_info())
  File "/home/kurapan/miniconda3/envs/chainer32/lib/python3.5/site-packages/six.py", line 693, in reraise
    raise value
  File "/home/kurapan/miniconda3/envs/chainer32/lib/python3.5/site-packages/chainer/training/trainer.py", line 299, in run
    update()
  File "/home/kurapan/miniconda3/envs/chainer32/lib/python3.5/site-packages/chainer/training/updater.py", line 223, in update
    self.update_core()
  File "/home/kurapan/miniconda3/envs/chainer32/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 206, in update_core
    loss = _calc_loss(self._master, batch)
  File "/home/kurapan/miniconda3/envs/chainer32/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 235, in _calc_loss
    return model(*in_arrays)
  File "/home/kurapan/Code/3rdparty/repos/scene_text/see_temp2/see/chainer/utils/multi_accuracy_classifier.py", line 44, in __call__
    self.y = self.predictor(*x)
  File "/home/kurapan/Code/3rdparty/repos/scene_text/see_temp2/see/chainer/models/text_recognition.py", line 75, in __call__
    h = self.localization_net(images)
  File "/home/kurapan/Code/3rdparty/repos/scene_text/see_temp2/see/chainer/models/ic_stn.py", line 80, in __call__
    h = self.bn0(self.conv0(images))
  File "/home/kurapan/miniconda3/envs/chainer32/lib/python3.5/site-packages/chainer/links/connection/convolution_2d.py", line 156, in __call__
    x, self.W, self.b, self.stride, self.pad)
  File "/home/kurapan/miniconda3/envs/chainer32/lib/python3.5/site-packages/chainer/functions/connection/convolution_2d.py", line 467, in convolution_2d
    y, = fnode.apply(args)
  File "/home/kurapan/miniconda3/envs/chainer32/lib/python3.5/site-packages/chainer/function_node.py", line 245, in apply
    outputs = self.forward(in_data)
  File "/home/kurapan/miniconda3/envs/chainer32/lib/python3.5/site-packages/chainer/function_node.py", line 337, in forward
    return self.forward_gpu(inputs)
  File "/home/kurapan/miniconda3/envs/chainer32/lib/python3.5/site-packages/chainer/functions/connection/convolution_2d.py", line 158, in forward_gpu
    handle = cudnn.get_handle()
  File "cupy/cudnn.pyx", line 24, in cupy.cudnn.get_handle
  File "cupy/cudnn.pyx", line 32, in cupy.cudnn.get_handle
  File "cupy/cuda/cudnn.pyx", line 463, in cupy.cuda.cudnn.create
  File "cupy/cuda/cudnn.pyx", line 444, in cupy.cuda.cudnn.check_status
cupy.cuda.cudnn.CuDNNError: CUDNN_STATUS_INTERNAL_ERROR: b'CUDNN_STATUS_INTERNAL_ERROR'

I don't really know how to solve it because I have no clue what might be the cause. I downloaded the corresponding data, created a curriculum file, and tried to run it. As for train_fsns.py, it runs without a problem. Would it be possible for you to check if the other files run too if I send a PR?

Bartzi commented 5 years ago

Hmm, interesting error. I can think of two different causes:

BatchNorm sometimes has problems when run with CuDNN, but I haven't had this problem in quite a while. It might help help to turn off CuDNN for this BatchNorm layer.
It might be that the number of input channels to the conv0 layer is not correct (this is actually the more reasonable explanation)

But, I can also check this, once you opened the PR =)

janzd commented 5 years ago

Hi! I have a question. Did you actually use the full validation when training the model? I experience a weird bug caused by full validation. When I set the conditions so that the dataset gets enlarged before the first epoch ends, the training progresses without any issue, but when I try to enlarge the dataset in epoch 2 or later, the training process freezes after the dataset gets enlarged. I checked that the code responsible for enlarging gets fully executed, so the issue must occur somewhere in Chainer trainer code. And as it just freezes, it doesn't produce any error traceback that could be used to debug it. Have you experienced something like that?

Bartzi commented 5 years ago

Honestly, I can't remember. That is too long ago. I think I always used the fast validator during training because the other just needed to much time. It could be that the chainer validator that runs the whole validation set gets stuck in an infinite loop, because the itererator is set to repeat, which it shouldn't be (see this line.

Other than that I don't know what the problem might be. Do you see any GPU usage when it is 'stuck'?

janzd commented 5 years ago

I see. Thanks. I guess I'll drop the full validation then. I've tried playing a bit with the iterator but it always gets stuck somewhere. There seemed to be some GPU usage but I haven't checked that thoroughly. Maybe I'll check it again later.

Bartzi / see

FSNS curriculum learning gt files #57