quite some character level bboxes are of zero width or zero height

mmmmmore commented 4 years ago

thanks again for open sourcing your dataset and generation code. When training a model that requires character level labelling, I found that many of the character level bboxes either have zero width or zero height. May I know a suggested way to clean the data ?

Jyouhou commented 4 years ago

Can you provide some examples that have zero width or heights? There should not be such data.

FYI, when I did the experiments with detection models, I filtered out bbox that are smaller than 10 px in either side. It should be enough to help.

Jyouhou commented 4 years ago

In the current setting, some text are very small. This may boost detection performance but harm recognition performance. I would recommend you to filter out char/words that are too small.

mmmmmore commented 4 years ago

file english/sub_112/labels/1674.json [239, 682, 239, 682, 239, 685, 239, 685] [761, 230, 761, 230, 761, 240, 761, 240]

file english/sub_112/labels/7559.json [625, 300, 625, 300, 625, 305, 625, 305]

file english/sub_112/labels/4852.json [574, 509, 575, 509, 575, 509, 574, 509]

I tried to load 200 images and filtered all character bboxes whose corresponding word bbox is less than 10 pixel height. There are still 17 character level bboxes that are either of zero width or zero height.

I filtered out bbox that are smaller than 10 px in either side.

I am not sure whether this filtering is proper since there are characters or symbols that are thin ( like I ) or short ( like "). filtering makes sense for word bbox but is it also valid for character level bbox ?

Jyouhou commented 4 years ago

In my experiments, small words/characters can be beneficial in detection but may harm recognition. It makes better sense to only train on characters that are not too small.

Jyouhou / UnrealText

quite some character level bboxes are of zero width or zero height #12