about trainging details?

backtime92 commented 5 years ago

1、what is the size of input image? 2、Did you adjust the lr during the training? 3、What data augment methods are used in the data preprocessing process? 4、when I train SynthText dataset that it's very slow because of the huge number of images, are there some advice can give me about how to accelerate the training process. 5 、How long have you trained the model and how many gpus did you use?

Godricly commented 5 years ago

Also where does the OHEM takes place?

YoungminBaek commented 5 years ago

@backtime92 1、what is the size of input image?

All images for training were resized to 768x768. An augmentation that enlarges the image using random crop is applied.

2、Did you adjust the lr during the training?

Yes, we did. In our experience, lr decay has a huge impact on learning.

3、What data augment methods are used in the data preprocessing process?

Nothing special, Basic image augmentations are used such as random crop, color variation, and image rotation. Note that random cropping can cause some characters in the text instance to go out of the screen, but it does not affect performance much, so you do not have to check the boundary condition.

4、when I train SynthText dataset that it's very slow because of the huge number of images, are there some advice can give me about how to accelerate the training process.

Actually, we think training with SynthText is not that slow since we do not need to make pseudo ground truth. When it comes to the real image, it becomes very slow if you sequentially generate pseudo-GT and training with them. So we configured a dedicated GPU to obtain pseudoGT in parallel.

5 、How long have you trained the model and how many gpus did you use?

I used 4 GPUs, and one of them is in charge of making pseudo-GTs. In our case, it takes 2~3 days for 30k iterations.

@Godricly

Also where does the OHEM takes place?

After the loss value for each pixel is calculated, OHEM selects the pixel with high loss in the negative pixels (region where the region score is smaller than 0.1).

YanShuang17 commented 5 years ago

@YoungminBaek Could you share the lr details? e.g. base_lr and lr_decay_scheduler ? I set the base lr is 1e-3 and multiply 0.1 every epoch in SynthText，But the performance is not pretty good...Thanks!

Godricly commented 5 years ago

@YoungminBaek Thank you for your reply. May I ask you about the input size of word patch? What is the height of each patch? Did you set any limit of patch width (Either the max length for long textline or the min length for short words which may suffer from label error)?

backtime92 commented 5 years ago

@YoungminBaek Thank you for your information. Could you share your lr change details and batch size?

backtime92 commented 5 years ago

@YoungminBaek How many high loss pixel in the negative pixels are selected? Or selects all high loss pixel in the negative pixels (region where the region score is smaller than 0.1) and set ratio is 1:3?

backtime92 commented 5 years ago

@YoungminBaek I select all positive pixels and triple negative high loss pixels to train the model, but I found that loss are very high about 4000 when I train for 8000 iteartions. does it normal? Could you share you lr details, initial lr and which epoch should change lr?

YoungminBaek commented 5 years ago

@YanShuang17 In my experiments, the initial lr is 1e-4, and multiply 0.8 for every 10k iterations. However, in my opinion, the lr parameter was not sensitive when training. Rather, please make sure that the character bounding boxes are well found when weak-supervision.

FYI, the H-mean of IC13 was up to 0.75 after training on SynthText.

@Godricly The width of a word patch size is varied, and the height is fixed to 64. And, I did not set the limit of patch width since there was no problem with long-length word patch.

@backtime92 The batch size depends on the number of GPUS that you are using. I fed in 8 images to 1 GPU. As you mentioned, the positive pixels are regions where the region score GT is larger than 0.1, and pos-neg ratio is 1:3. And the loss is summed and divided by the number of pos, neg pixels each. I think the omitting of a division operation is the reason for the high loss in your experiment.

Godricly commented 5 years ago

@YoungminBaek Thank you for the reply. Maybe I didn't make it clean. The long word case is an efficiency concern while the short word case is a performance concern. Have you ever try to replace the word-level forwarding with image-level forwarding to boost the speed? did you ignore short words with 1 or 2 chars or the size close 2 64X64 when cropped out? since the labelling is buggy in some ICDAR datasets.

YanShuang17 commented 5 years ago

Thanks for your reply so much @YoungminBaek

sonack commented 5 years ago

@YanShuang17 In my experiments, the initial lr is 1e-4, and multiply 0.8 for every 10k iterations. However, in my opinion, the lr parameter was not sensitive when training. Rather, please make sure that the character bounding boxes are well found when weak-supervision.

FYI, the H-mean of IC13 was up to 0.75 after training on SynthText.

@Godricly The width of a word patch size is varied, and the height is fixed to 64. And, I did not set the limit of patch width since there was no problem with long-length word patch.

@backtime92 The batch size depends on the number of GPUS that you are using. I fed in 8 images to 1 GPU. As you mentioned, the positive pixels are regions where the region score GT is larger than 0.1, and pos-neg ratio is 1:3. And the loss is summed and divided by the number of pos, neg pixels each. I think the omitting of a division operation is the reason for the high loss in your experiment. @YoungminBaek Have you used the synced-BN for training?

sonack commented 5 years ago

@YanShuang17 In my experiments, the initial lr is 1e-4, and multiply 0.8 for every 10k iterations. However, in my opinion, the lr parameter was not sensitive when training. Rather, please make sure that the character bounding boxes are well found when weak-supervision.

FYI, the H-mean of IC13 was up to 0.75 after training on SynthText.

@Godricly The width of a word patch size is varied, and the height is fixed to 64. And, I did not set the limit of patch width since there was no problem with long-length word patch.

@backtime92 The batch size depends on the number of GPUS that you are using. I fed in 8 images to 1 GPU. As you mentioned, the positive pixels are regions where the region score GT is larger than 0.1, and pos-neg ratio is 1:3. And the loss is summed and divided by the number of pos, neg pixels each. I think the omitting of a division operation is the reason for the high loss in your experiment.

@YoungminBaek What kind of loss is used in OHEM judgement? Only region score loss? Or the full loss as below?

brooklyn1900 commented 5 years ago

Could you please share your training code with me ? @backtime92 I really need it , Thanks a lot!

YoungminBaek commented 5 years ago

@Godricly I have tried image-level forwarding too. Interesting you came up with the same idea as mine! I found that it will be a good option to use both of word- and image- forwarding and choose the character bounding box with low length error. Definitely, this is not for boosting the speed, but for performance. To boost up the learning time, I applied multi-processing techniques and set a single GPU is dedicated to generating the psuedo-character boxes.

I did not ignore short words. But as you mentioned, ICDAR datasets are buggy and have no rules for annotating vertical texts. So I consider the text is vertically written if the height of the word bounding box is longer than twice of its width in MLT dataset.

YoungminBaek commented 5 years ago

@sonack I did not use synchronized-bn for this work since the operator was not supported in pytorch at that time. Recently pytorch provide the official functionality of the SyncBatchNorm, and this is definitely helpful for training. https://pytorch.org/docs/master/nn.html#torch.nn.SyncBatchNorm

For OHEM judgment, MSE loss was used as well. Since the number of positive samples in region and affinity scores were different, the OHEM was applied separately to region and affinity scores.

backtime92 commented 5 years ago

@brooklyn1900 I will release the syndata training code as soon as possible.

brooklyn1900 commented 5 years ago

@backtime92 Thanks a lot!

Godricly commented 5 years ago

@YoungminBaek How is the performance gain for that? Can you provide some hints about the training data flow? How do you share models between different GPUs? If you used GPU 0 to generate bounding box, how did you share them with other devices? My guess is DDP module in pytorch, but I'm familiar with that. With random crop for each image, things are getting even working when handling word patched cut by the cropping window.

brooklyn1900 commented 5 years ago

I got region scores and affinity scores， so what is the loss function ? thanks !

backtime92 commented 5 years ago

@brooklyn1900 MSE loss， I am not clearly what you want to ask.

ThisIsIsaac commented 5 years ago

@backtime92 would love to hear your update on the training code!

ThisIsIsaac commented 5 years ago

@backtime92 do we add loss from region score and affinity score unweighted?

SealQ commented 4 years ago

@backtime92 @ThisIsIsaac Hello, can you send a copy of the training code? I want to try model training with my own data set, thanks!

richaagrawa commented 4 years ago

Hello, @YoungminBaek @ClovaAIAdmin @backtime92 @ThisIsIsaac, Can you please send a copy of the training code? Model is working good on my dataset but its missing few numbers so I want to try training the model with my own dataset, Thank you

backtime92 commented 4 years ago

@richaagrawa @SealQ https://github.com/backtime92/CRAFT-Reimplementation

clovaai / CRAFT-pytorch

about trainging details? #18