Jyouhou / UnrealText

Synthetic Scene Text from 3D Engines
MIT License
244 stars 39 forks source link

Chinese in the english dataset? #25

Closed alcinos closed 2 years ago

alcinos commented 3 years ago

Hello,

First off, thanks for releasing this dataset, I believe it is a very useful contribution for the OCR community.

I have a question regarding the splits. The paper indicates that the English dataset is generated using English words. As a result, I expected to find mostly (only?) Latin characters. However, from what I can tell, there is quite a bit of exotic utf8 characters, which on first inspection seem to be mostly Chinese/Arabic characters.

Is this expected? As noted in #11, this is problematic since the Chinese characters seem to be often miss-rendered.

Here is an example, taken from sub_2/imgs/4482.jpg image

Using the following snippet, run on the sub_0 to sub_5, I estimate that such characters appear in ~15% of the images of the dataset, which I wouldn’t qualify as negligible:

import json
import tqdm
import glob
all_files = glob.glob("unrealtext/sub_*/labels/*.json")
count = 0
for fname in tqdm.tqdm(all_files):
    with open(fname, "r") as f:
        data = json.load(f)
        for t in data["text"]:
            if max([ord(c) for c in t]) > 500:
                count += 1
                break
print("problematic proportion", count / len(all_files))

Did I miss something? Could you perhaps clarify on how the english words were mined in the first place? Thanks in advance

Jyouhou commented 3 years ago

Hi,

Thanks for you interest in our work.

This is a known bug, and as noted in the other thread, the detector performance is still much better than other methods and the presence is not too high (if you count #errors / #words). This bug will be fixed later.

As written in the paper, I used the same corpus as SynthText for the English only dataset. I'll do more inspection to see how this happened.