MichalBusta / E2E-MLT

E2E-MLT - an Unconstrained End-to-End Method for Multi-Language Scene Text
MIT License
291 stars 84 forks source link

Questions on training the model from scratch #10

Closed alwc closed 5 years ago

alwc commented 5 years ago

First of all, congrats on getting the best paper award at the 3rd International Workshop on Robust Reading and releasing the v2 of the paper on arXiv!

Few days ago I've been trying to train the model from scratch and I've came across some questions:

1/ Why do you use instance norm instead of other normalization methods like batch norm, group norm and etc.? Have you compared the results of using different normalization methods?

1.b/ How come for conv_dw_in, the InstanceNorm2d layer didn't set affine=True (by default, affine=False, which means there are no learnable parameters for this instance norm layer) while the rest of your InstanceNorm2d layers have affine=True.

2/ How long does it take to train the e2e-mlt.h5 model? Was it trained on multiple GPUs?

3/ Have you tried using transfer learning like how argman/EAST uses pretrained resnet-50?

Thanks!

MichalBusta commented 5 years ago

Hi Alex,

On 10/12/2018 04:49, Alex Lee wrote:

First of all, congrats on getting the best paper award at the 3rd International Workshop on Robust Reading and releasing the v2 of the paper on arXiv!

Few days ago I've been trying to train the model from scratch and I've came across some questions:

1/ Why do you use instance norm instead of other normalization methods like batch norm, group norm and etc.? Have you compared the results of using different normalization methods?

two main reasons:

 - desing choice: on first layer the instance normalization cause (sort of) the color and illumination invariance, (for text it is desired)

 - hardware: - for most experiments I'm using old gaming machine (~4GB of GPU memory), so I can not form reasonable batch size for training detector.

I can not provide ablation study :(  (= no free resources for computation)

1.b/ How come for |conv_dw_in| https://github.com/MichalBusta/E2E-MLT/blob/master/models.py#L71, the |InstanceNorm2d| layer didn't set |affine=True| (by default, |affine=False|, which means there are no learnable parameters for this instance norm layer) while the rest of your |InstanceNorm2d| layers have |affine=True|.

just typo - no reason.

2/ How long does it take to train the |e2e-mlt.h5| model? Was it trained on multiple GPUs?

from scratch ~ 3 days on low end gaming machine.

3/ Have you tried using transfer learning like how argman/EAST https://github.com/argman/EAST uses pretrained resnet-50?

no - we have been doing just base-line experiment, we hope that better models will come from community. (there is a proposition for end-to-end MLT competition for ICDAR conference, this project can provide baseline method)

Thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/MichalBusta/E2E-MLT/issues/10, or mute the thread https://github.com/notifications/unsubscribe-auth/AD6jsFBoNMJUVYoE1jRJAAgLTRKVzgZOks5u3dnDgaJpZM4ZKl1K.

mattroos commented 5 years ago

I'm trying to replicate @MichalBusta's results with regard to the detection part of the model. As done in the arxiv paper, I'm using ICDAR-2017-MLT only, batch size of 16, learning_rate=0.0001, and Adam optimizer. Basically, just as the code in train.py defaults to (except the batch size). The figure shows bbox_loss (of the training batches) for the first 2000+ batches (ignore the change in color of the markers at 1000). @MichalBusta, does this seem to match expectations? When I initialize from your trained model, bbox_loss average is about 2.0. Is that also in line with expectations?

training_loss_bbox
alwc commented 5 years ago

@MichalBusta Ok, I'll let you know if I've discovered ways to improve the model.

By the way, I saw that your text detection model has added attention mechanism but it was not mentioned in the paper. Can you point to any papers that your implementation is based on?

MichalBusta commented 5 years ago

yes, it is ok.

MichalBusta commented 5 years ago

@alwc

By the way, I saw that your text detection model has added attention mechanism but it was not mentioned in the paper. Can you point to any papers that your implementation is based on?

no paper, we just run several models and this one gets better numbers on validation set (but it is kind of play on ROC curve - better recall, lower precision ... )

mohammedayub44 commented 3 years ago

@alwc @mattroos I know this is and old thread. I'm trying to do something similar with Synthetic generated text (just english for now). Is it possible to share your thoughts on results and optimizations if any made to the network etc.

Thanks !

mattroos commented 3 years ago

@mohammedayub44 sorry but my final results didn't differ notably, and I have been using the network as-is, without further optimizations--although I'll like to try reducing computational costs at some point (e.g., network pruning).

mohammedayub44 commented 3 years ago

@mattroos Thanks for verifying. How long and how many steps (approx) did you have to train to get good results (the bbox loss you attached above). I have trained from scratch for about 20,000 steps (1.5days) on Synthetic data to get the loss close to what you have. The bbox results are okayish but the OCR results are very bad. Any idea why this could happen.

I'm using following parameters -

python train.py -train_list=/home/ubuntu/mayub/text_detection/SynthText/data/e2e_trainMLT.txt \
-batch_size=8 -num_readers=5 -debug=0 -input_size=512 -ocr_batch_size=256 \
-ocr_feed_list=/home/ubuntu/mayub/text_detection/SynthText/data/crops_icdar/train.txt

I'm testing on sample of ICDAR13 images, look like below -

image

Looks like something is off. Any thoughts appreciated.

mattroos commented 3 years ago

Sorry, @mohammedayub44, I'm too far removed from that moment in time to recall much beyond what you see in my figure and post. A bit over 2000 batches of CDAR-2017-MLT got to that bbox loss of ~0.4. I don't recall if that figure is of training data or test data. I think I was only assessing the bbox branch, not the OCR branch.

mohammedayub44 commented 3 years ago

@mattroos That's fine. Thanks for getting back though.