Closed alcinos closed 2 years ago
Hi,
Thanks for you interest in our work.
This is a known bug, and as noted in the other thread, the detector performance is still much better than other methods and the presence is not too high (if you count #errors / #words). This bug will be fixed later.
As written in the paper, I used the same corpus as SynthText for the English only dataset. I'll do more inspection to see how this happened.
Hello,
First off, thanks for releasing this dataset, I believe it is a very useful contribution for the OCR community.
I have a question regarding the splits. The paper indicates that the English dataset is generated using English words. As a result, I expected to find mostly (only?) Latin characters. However, from what I can tell, there is quite a bit of exotic utf8 characters, which on first inspection seem to be mostly Chinese/Arabic characters.
Is this expected? As noted in #11, this is problematic since the Chinese characters seem to be often miss-rendered.
Here is an example, taken from sub_2/imgs/4482.jpg
Using the following snippet, run on the sub_0 to sub_5, I estimate that such characters appear in ~15% of the images of the dataset, which I wouldn’t qualify as negligible:
Did I miss something? Could you perhaps clarify on how the english words were mined in the first place? Thanks in advance