Closed alwc closed 5 years ago
Hi Alex, I would not claim any superiority - one needs quite big dataset to compare 2 methods.
The main diff. (as always in deep learning) is the data - ICDAR MLT is much larger dataset than any previous, we also do not do fine tuning for specific dataset, we try to fit them all.
incremental updates are:
Thanks for your prompt reply @MichalBusta ! I just started learning about OCR by myself not too long ago so your insightful replies are very valuable to me. So far I've learned a lot from your repo!
I have a few follow up questions:
In resize_image
, your max_size
is 1585152
. How did you decide to pick that number?
In your read me, it listed the following dataset:
- ICDAR MLT Dataset
- ICDAR 2015 Dataset
- RCTW-17
- Synthetic MLT Data (Arabic, Bangla, Chinese, Japanese, Korean, Latin )
Did you use all those dataset to train e2e-mlt.h5
?
- In
resize_image
, yourmax_size
is1585152
. How did you decide to pick that number?
1548*1024 ~ something that work (with reserve) for my 2GB notebook GPU
- In your read me, it listed the following dataset:
- ICDAR MLT Dataset
- ICDAR 2015 Dataset
- RCTW-17
- Synthetic MLT Data (Arabic, Bangla, Chinese, Japanese, Korean, Latin )
Did you use all those dataset to train
e2e-mlt.h5
?
ICDAR MLT, ICDAR 2015, ICDAR 2013 + Synth for pre-training
- Suppose all my images are in Chinese and English. Do you think I'll get better performance if I train E2E-MLT with just English and Chinese?
yes, but not significantly (we did the experiment where we took softmax just over the "latin filters" and the result was just 1~2% better)
but of course it depends - there is high class in-balance (Latin and Chinese are most frequent), so if you will do Latin + Korean there will probably more significant boost (there is confusion table in paper which somehow try to show, that the network have the language model somewhere inside)
Thanks again @MichalBusta , I'm looking forward to read the updated paper!
Hi @MichalBusta
I've been revisiting E2E-MLT and EAST lately and I recalled you said:
learning angle in sin and cos representation (much more stable angle predictions), the method also predict reading direction
I've been reading the source code but I'm not sure which part contributes to "predict reading direction", do you mind shedding some light? Thanks!
Hi @MichalBusta
I've been revisiting E2E-MLT and EAST lately and I recalled you said:
learning angle in sin and cos representation (much more stable angle predictions), the method also predict reading direction
I've been reading the source code but I'm not sure which part contributes to "predict reading direction", do you mind shedding some light? Thanks!
in loss function: https://github.com/MichalBusta/E2E-MLT/blob/be8f074da7a60ff9f88bc2ded39c00940ec4ba26/models.py#L452 the read-out is: https://github.com/MichalBusta/E2E-MLT/blob/be8f074da7a60ff9f88bc2ded39c00940ec4ba26/nms/adaptor.cpp#L88
hope it helps, Michal
I walked through the code and I have some following up questions:
1/ "Predicting reading direction" is basically another way of saying the model can detect bounding boxes that rotate 0-360 degree right?
2/ For most of the input images, the shapes of the bounding boxes are approximately rectangular. However, occasionally I'm getting bounding boxes that are really off (i.e. not really a rectangle) since for those predictions, atheir angle_sin ** 2 + angle_cos ** 2
are much smaller than 1 . Do you have any idea to constraint angle_sin ** 2 + angle_cos ** 2 = 1
? One idea comes to mind is to add a new loss function: angle_sin ** 2 + angle_cos ** 2 - 1
, but this still doesn't guarantee the result will be equal to 1 when predicting.
3/ In nms/adaptor.cpp
Line 93-99
3a/ What do ph
and phx
represent? Why both of them equal to 9
?
3b/ What do p_left
, p_top
, p_right
, and p_bt
represent? Why do you need to use exponential?
4/ Your "weighted merge" in nms/nms.h
Line 58-68 seems to be quite different compared to EAST's locality-aware NMS. What's the logic behind?
I’m sorry for asking so many questions. It surprises me for some scenario, your implementation performs much better than argman/EAST and I'm curious to know why!
I walked through the code and I have some following up questions:
1/ "Predicting reading direction" is basically another way of saying the model can detect bounding boxes that rotate 0-360 degree right?
right. (east is just detector, so they do not need to predict reading direction - IoU metric will give same number)
2/ For most of the input images, the shapes of the bounding boxes are approximately rectangular. However, occasionally I'm getting bounding boxes that are really off (i.e. not really a rectangle) since for those predictions, atheir
angle_sin ** 2 + angle_cos ** 2
are much smaller than 1 . Do you have any idea to constraintangle_sin ** 2 + angle_cos ** 2 = 1
? One idea comes to mind is to add a new loss function:angle_sin ** 2 + angle_cos ** 2 - 1
, but this still doesn't guarantee the result will be equal to 1 when predicting.
sure, pretty basic trick with reparametrization: cos(a) = c / |c + s| and sin(a) = s / |c + s|
3/ In
nms/adaptor.cpp
Line 93-993a/ What do
ph
andphx
represent? Why both of them equal to9
?3b/ What do
p_left
,p_top
,p_right
, andp_bt
represent? Why do you need to use exponential?4/ Your "weighted merge" in
nms/nms.h
Line 58-68 seems to be quite different compared to EAST's locality-aware NMS. What's the logic behind?
this if for longer discussion:
I’m sorry for asking so many questions. It surprises me for some scenario, your implementation performs much better than argman/EAST and I'm curious to know why!
- we have used much more data and more data augumentation during the training
For 2/, I don’t think cos(a) = c / |c + s|
and sin(a) = s / |c + s|
will constrain sin(a) ** 2 + cos(a) ** 2 == 1
, maybe cos(a) = c**2 / |c**2 + s**2|
and sin(a) = s**2 / |c**2 + s**2|
will work but we loses the +/- sign.
What do you think of cos(a) = cos(atan2(c/s)), sin(a) = sin(atan2(c/s))
? It seems this could constrain sin(a) ** 2 + cos(a) ** 2 == 1
.
For 3-4/,
the version in our repo uses assumption that longer prediction leads to bigger error (you can measure this and make approximation - small improv. over paper version +2% recall)
Do you mind locating where the repo "uses assumption that longer prediction leads to bigger error"?
For 2/, I don’t think
cos(a) = c / |c + s|
andsin(a) = s / |c + s|
will constrainsin(a) ** 2 + cos(a) ** 2 == 1
,
cos(a) = c / sqrt(c^2 + s^2) -> cos(a)^2 = c^2 / (c^2 + s^2 ) sin(a) = s / sqrt(c^2 + s^2) -> sin(a)^2 = s^2 / (c^2 + s^2 ) -> 1 = (c^2 + s^2) / (c^2 + s^2) ?
What do you think of
cos(a) = cos(atan2(c/s)), sin(a) = sin(atan2(c/s))
? It seems this could constrainsin(a) ** 2 + cos(a) ** 2 == 1
.For 3-4/,
the version in our repo uses assumption that longer prediction leads to bigger error (you can measure this and make approximation - small improv. over paper version +2% recall)
Do you mind locating where the repo "uses assumption that longer prediction leads to bigger error"?
bigger distance = lower weight for merge: https://github.com/MichalBusta/E2E-MLT/blob/4d6c92a5ee4f66a0a6d39757f2e8331893ddce04/nms/adaptor.cpp#L109
and merge: https://github.com/MichalBusta/E2E-MLT/blob/4d6c92a5ee4f66a0a6d39757f2e8331893ddce04/nms/nms.h#L58
For 2/, I don’t think
cos(a) = c / |c + s|
andsin(a) = s / |c + s|
will constrainsin(a) ** 2 + cos(a) ** 2 == 1
,cos(a) = c / sqrt(c^2 + s^2) -> cos(a)^2 = c^2 / (c^2 + s^2 ) sin(a) = s / sqrt(c^2 + s^2) -> sin(a)^2 = s^2 / (c^2 + s^2 ) -> 1 = (c^2 + s^2) / (c^2 + s^2) ?
My bad! I misread your | . |
sign as abs
instead of l2-norm. Let me see if I could train the model with your suggested loss.
Just FYI, your suggested constraint for the loss function seems to work from my preliminary study!
I notice that your EAST implementation performs much better than argman/EAST when there are large fonts.
Here is the prediction from your model:
And here is the prediction from argman/EAST
Do you mind sharing what's the key components of improving such detection? Thanks!