Chinese in the english dataset?

Hello,

First off, thanks for releasing this dataset, I believe it is a very useful contribution for the OCR community.

I have a question regarding the splits. The paper indicates that the English dataset is generated using English words. As a result, I expected to find mostly (only?) Latin characters. However, from what I can tell, there is quite a bit of exotic utf8 characters, which on first inspection seem to be mostly Chinese/Arabic characters.

Is this expected? As noted in #11, this is problematic since the Chinese characters seem to be often miss-rendered.

Here is an example, taken from sub_2/imgs/4482.jpg

Using the following snippet, run on the sub_0 to sub_5, I estimate that such characters appear in ~15% of the images of the dataset, which I wouldn’t qualify as negligible:

import json
import tqdm
import glob
all_files = glob.glob("unrealtext/sub_*/labels/*.json")
count = 0
for fname in tqdm.tqdm(all_files):
    with open(fname, "r") as f:
        data = json.load(f)
        for t in data["text"]:
            if max([ord(c) for c in t]) > 500:
                count += 1
                break
print("problematic proportion", count / len(all_files))

Did I miss something? Could you perhaps clarify on how the english words were mined in the first place? Thanks in advance

Jyouhou / UnrealText

Chinese in the english dataset? #25