Your EAST implementation vs. argman/EAST

alwc commented 5 years ago

I notice that your EAST implementation performs much better than argman/EAST when there are large fonts.

Here is the prediction from your model:

And here is the prediction from argman/EAST

Do you mind sharing what's the key components of improving such detection? Thanks!

MichalBusta commented 5 years ago

Hi Alex, I would not claim any superiority - one needs quite big dataset to compare 2 methods.

The main diff. (as always in deep learning) is the data - ICDAR MLT is much larger dataset than any previous, we also do not do fine tuning for specific dataset, we try to fit them all.

incremental updates are:

learning angle in sin and cos representation (much more stable angle predictions), the method also predict reading direction
Feature Pyramid Network (we do the multi-scale learning, so this can help for big text)
final, but not used in demo - fine bounding box guided by OCR (for evaluation, we split the bounding boxes by spaces, punctuation chars into the words)

alwc commented 5 years ago

Thanks for your prompt reply @MichalBusta ! I just started learning about OCR by myself not too long ago so your insightful replies are very valuable to me. So far I've learned a lot from your repo!

I have a few follow up questions:

In resize_image, your max_size is 1585152. How did you decide to pick that number?
In your read me, it listed the following dataset:

ICDAR MLT Dataset

ICDAR 2015 Dataset

RCTW-17

Synthetic MLT Data (Arabic, Bangla, Chinese, Japanese, Korean, Latin )

Did you use all those dataset to train e2e-mlt.h5?

Suppose all my images are in Chinese and English. Do you think I'll get better performance if I train E2E-MLT with just English and Chinese?

MichalBusta commented 5 years ago

In resize_image, your max_size is 1585152. How did you decide to pick that number?

1548*1024 ~ something that work (with reserve) for my 2GB notebook GPU

In your read me, it listed the following dataset:

ICDAR MLT Dataset

ICDAR 2015 Dataset

RCTW-17

Synthetic MLT Data (Arabic, Bangla, Chinese, Japanese, Korean, Latin )

Did you use all those dataset to train e2e-mlt.h5?

ICDAR MLT, ICDAR 2015, ICDAR 2013 + Synth for pre-training

Suppose all my images are in Chinese and English. Do you think I'll get better performance if I train E2E-MLT with just English and Chinese?

yes, but not significantly (we did the experiment where we took softmax just over the "latin filters" and the result was just 1~2% better)

but of course it depends - there is high class in-balance (Latin and Chinese are most frequent), so if you will do Latin + Korean there will probably more significant boost (there is confusion table in paper which somehow try to show, that the network have the language model somewhere inside)

alwc commented 5 years ago

Thanks again @MichalBusta , I'm looking forward to read the updated paper!

alwc commented 5 years ago

Hi @MichalBusta

I've been revisiting E2E-MLT and EAST lately and I recalled you said:

learning angle in sin and cos representation (much more stable angle predictions), the method also predict reading direction

I've been reading the source code but I'm not sure which part contributes to "predict reading direction", do you mind shedding some light? Thanks!

MichalBusta commented 5 years ago

Hi @MichalBusta

I've been revisiting E2E-MLT and EAST lately and I recalled you said:

learning angle in sin and cos representation (much more stable angle predictions), the method also predict reading direction

I've been reading the source code but I'm not sure which part contributes to "predict reading direction", do you mind shedding some light? Thanks!

in loss function: https://github.com/MichalBusta/E2E-MLT/blob/be8f074da7a60ff9f88bc2ded39c00940ec4ba26/models.py#L452 the read-out is: https://github.com/MichalBusta/E2E-MLT/blob/be8f074da7a60ff9f88bc2ded39c00940ec4ba26/nms/adaptor.cpp#L88

hope it helps, Michal

alwc commented 5 years ago

I walked through the code and I have some following up questions:

1/ "Predicting reading direction" is basically another way of saying the model can detect bounding boxes that rotate 0-360 degree right?

2/ For most of the input images, the shapes of the bounding boxes are approximately rectangular. However, occasionally I'm getting bounding boxes that are really off (i.e. not really a rectangle) since for those predictions, atheir angle_sin ** 2 + angle_cos ** 2 are much smaller than 1 . Do you have any idea to constraint angle_sin ** 2 + angle_cos ** 2 = 1? One idea comes to mind is to add a new loss function: angle_sin ** 2 + angle_cos ** 2 - 1, but this still doesn't guarantee the result will be equal to 1 when predicting.

3/ In nms/adaptor.cpp Line 93-99

3a/ What do ph and phx represent? Why both of them equal to 9?

3b/ What do p_left, p_top, p_right, and p_bt represent? Why do you need to use exponential?

4/ Your "weighted merge" in nms/nms.h Line 58-68 seems to be quite different compared to EAST's locality-aware NMS. What's the logic behind?

I’m sorry for asking so many questions. It surprises me for some scenario, your implementation performs much better than argman/EAST and I'm curious to know why!

MichalBusta commented 5 years ago

I walked through the code and I have some following up questions:

1/ "Predicting reading direction" is basically another way of saying the model can detect bounding boxes that rotate 0-360 degree right?

right. (east is just detector, so they do not need to predict reading direction - IoU metric will give same number)

2/ For most of the input images, the shapes of the bounding boxes are approximately rectangular. However, occasionally I'm getting bounding boxes that are really off (i.e. not really a rectangle) since for those predictions, atheir angle_sin ** 2 + angle_cos ** 2 are much smaller than 1 . Do you have any idea to constraint angle_sin ** 2 + angle_cos ** 2 = 1? One idea comes to mind is to add a new loss function: angle_sin ** 2 + angle_cos ** 2 - 1, but this still doesn't guarantee the result will be equal to 1 when predicting.

sure, pretty basic trick with reparametrization: cos(a) = c / |c + s| and sin(a) = s / |c + s|

3/ In nms/adaptor.cpp Line 93-99

3a/ What do ph and phx represent? Why both of them equal to 9?

3b/ What do p_left, p_top, p_right, and p_bt represent? Why do you need to use exponential?

4/ Your "weighted merge" in nms/nms.h Line 58-68 seems to be quite different compared to EAST's locality-aware NMS. What's the logic behind?

this if for longer discussion:

both versions of NMS are wrong (but the network is so good so we do not care too much :( )
the version in our repo uses assumption that longer prediction leads to bigger error (you can measure this and make approximation - small improv. over paper version +2% recall)
proper way can be:
- do the clustering right
- learn the weights (works better, but not significantly)

I’m sorry for asking so many questions. It surprises me for some scenario, your implementation performs much better than argman/EAST and I'm curious to know why!

we have used much more data and more data augumentation during the training

alwc commented 5 years ago

For 2/, I don’t think cos(a) = c / |c + s| and sin(a) = s / |c + s| will constrain sin(a) ** 2 + cos(a) ** 2 == 1, maybe cos(a) = c**2 / |c**2 + s**2| and sin(a) = s**2 / |c**2 + s**2| will work but we loses the +/- sign.

What do you think of cos(a) = cos(atan2(c/s)), sin(a) = sin(atan2(c/s))? It seems this could constrain sin(a) ** 2 + cos(a) ** 2 == 1.

For 3-4/,

the version in our repo uses assumption that longer prediction leads to bigger error (you can measure this and make approximation - small improv. over paper version +2% recall)

Do you mind locating where the repo "uses assumption that longer prediction leads to bigger error"?

MichalBusta commented 5 years ago

For 2/, I don’t think cos(a) = c / |c + s| and sin(a) = s / |c + s| will constrain sin(a) ** 2 + cos(a) ** 2 == 1,

cos(a) = c / sqrt(c^2 + s^2) -> cos(a)^2 = c^2 / (c^2 + s^2 ) sin(a) = s / sqrt(c^2 + s^2) -> sin(a)^2 = s^2 / (c^2 + s^2 ) -> 1 = (c^2 + s^2) / (c^2 + s^2) ?

What do you think of cos(a) = cos(atan2(c/s)), sin(a) = sin(atan2(c/s))? It seems this could constrain sin(a) ** 2 + cos(a) ** 2 == 1.

For 3-4/,

the version in our repo uses assumption that longer prediction leads to bigger error (you can measure this and make approximation - small improv. over paper version +2% recall)

Do you mind locating where the repo "uses assumption that longer prediction leads to bigger error"?

bigger distance = lower weight for merge: https://github.com/MichalBusta/E2E-MLT/blob/4d6c92a5ee4f66a0a6d39757f2e8331893ddce04/nms/adaptor.cpp#L109

and merge: https://github.com/MichalBusta/E2E-MLT/blob/4d6c92a5ee4f66a0a6d39757f2e8331893ddce04/nms/nms.h#L58

alwc commented 5 years ago

For 2/, I don’t think cos(a) = c / |c + s| and sin(a) = s / |c + s| will constrain sin(a) ** 2 + cos(a) ** 2 == 1,

cos(a) = c / sqrt(c^2 + s^2) -> cos(a)^2 = c^2 / (c^2 + s^2 ) sin(a) = s / sqrt(c^2 + s^2) -> sin(a)^2 = s^2 / (c^2 + s^2 ) -> 1 = (c^2 + s^2) / (c^2 + s^2) ?

My bad! I misread your | . | sign as abs instead of l2-norm. Let me see if I could train the model with your suggested loss.

alwc commented 5 years ago

Just FYI, your suggested constraint for the loss function seems to work from my preliminary study!

MichalBusta / E2E-MLT

Your EAST implementation vs. argman/EAST #5