FudanVI / benchmarking-chinese-text-recognition

This repository contains datasets and baselines for benchmarking Chinese text recognition.
MIT License
427 stars 52 forks source link

The HWDB and ICDAR2013 #2

Open yusirhhh opened 2 years ago

yusirhhh commented 2 years ago

Thank you very much for your work. Could you please supplement the experimental results on HWDB and ICADA2013. These two data sets are very important in Chinese handwriting recognition and have a relatively large amount of work, so that it is easier to compare the performance differences between different methods.

JingyeChen commented 2 years ago

Thanks for your attention to our work!

In fact, we initially collected HDWB2.0-2.2, ICDAR2013, and SCUT for the experiments. However, we observed that there existed some domain gaps regarding image styles among these three datasets (eg, HWDB2.0-2.2 and ICDAR2013 have clean backgrounds, while SCUT have more complex backgrounds suffering from uneven illumination, grids, etc.). So it is inefficient to combine them for training.

Additionally, we observed that HDWB2.0-2.2 and ICDAR2013 have fewer samples (52,220 and 3,432) compared with SCUT (116,643). The community mainly utilize HWDB1.0-1.2 (single Chinese character datasets) to synthesize text line datasets for training, which is a little inconvenient. So we only construct the handwriting dataset based on SCUT.

Thanks for your advice. Anyway, we will upload the lmdb-format HWDB2.0-2.2 and ICDAR2013 for further research.

yusirhhh commented 2 years ago

Hello, thank you very much for your reply! I downloaded the hwdb dataset(lmdb-format) that you released, but I have a little doubt. I found that the labels of the training set and the validation set are in half-width format, while the test set is in full-width format.

JingyeChen commented 2 years ago

Hello! These datasets are collected from official websites. You can manually convert them to half-width format for training.

yusirhhh commented 2 years ago

Can you provide the code for processing dgrl format data in hwdb or the hwdb dataset(png/jpg format)? I have encountered a problem in this step of parsing, and hope to get your suggestions.

JingyeChen commented 2 years ago
import struct
import os
import cv2 as cv
import numpy as np
from PIL import Image

dgrl = '/home/dataset/benchmark/temp/offline_handwriting/HWDB2.0Test/006-P16.dgrl'

def read_from_dgrl(dgrl, file):
    if not os.path.exists(dgrl):
        print('DGRL not exis!')
        return

    dir_name,base_name = os.path.split(dgrl)
    label_dir = dir_name+'_label'
    image_dir = dir_name+'_images'
    if not os.path.exists(label_dir):
        os.makedirs(label_dir)
    if not os.path.exists(image_dir):
        os.makedirs(image_dir)

    with open(dgrl, 'rb') as f:
        # 读取表头尺寸
        header_size = np.fromfile(f, dtype='uint8', count=4)
        header_size = sum([j<<(i*8) for i,j in enumerate(header_size)])
        # print(header_size)

        # 读取表头剩下内容,提取 code_length
        header = np.fromfile(f, dtype='uint8', count=header_size-4)
        code_length = sum([j<<(i*8) for i,j in enumerate(header[-4:-2])])
        # print(code_length)

        # 读取图像尺寸信息,提取图像中行数量
        image_record = np.fromfile(f, dtype='uint8', count=12)
        height = sum([j<<(i*8) for i,j in enumerate(image_record[:4])])
        width = sum([j<<(i*8) for i,j in enumerate(image_record[4:8])])
        line_num = sum([j<<(i*8) for i,j in enumerate(image_record[8:])])
        print('图像尺寸:')
        print(height, width, line_num)

        # 读取每一行的信息
        for k in range(line_num):
            print(k+1)

            # 读取该行的字符数量
            char_num = np.fromfile(f, dtype='uint8', count=4)
            char_num = sum([j<<(i*8) for i,j in enumerate(char_num)])
            print('字符数量:', char_num)

            # 读取该行的标注信息
            label = np.fromfile(f, dtype='uint8', count=code_length*char_num)
            label = [label[i]<<(8*(i%code_length)) for i in range(code_length*char_num)]
            label = [sum(label[i*code_length:(i+1)*code_length]) for i in range(char_num)]
            label = [struct.pack('I', i).decode('gbk', 'ignore')[0] for i in label]
            print('合并前:', label)
            label = ''.join(label)
            label = ''.join(label.split(b'\x00'.decode()))  # 去掉不可见字符 \x00,这一步不加的话后面保存的内容会出现看不见的问题
            print('合并后:', label)

            # 读取该行的位置和尺寸
            pos_size = np.fromfile(f, dtype='uint8', count=16)
            y = sum([j<<(i*8) for i,j in enumerate(pos_size[:4])])
            x = sum([j<<(i*8) for i,j in enumerate(pos_size[4:8])])
            h = sum([j<<(i*8) for i,j in enumerate(pos_size[8:12])])
            w = sum([j<<(i*8) for i,j in enumerate(pos_size[12:])])
            # print(x, y, w, h)

            # 读取该行的图片
            bitmap = np.fromfile(f, dtype='uint8', count=h*w)
            bitmap = np.array(bitmap).reshape(h, w)

            # 保存信息
            label_file = os.path.join(label_dir, base_name.replace('.dgrl', '_'+str(k)+'.txt'))
            with open(label_file, 'w') as f1:
                f1.write(label)
            bitmap_file = os.path.join(image_dir, base_name.replace('.dgrl', '_'+str(k)+'.jpg'))
            print(bitmap_file)
            cv.imwrite(bitmap_file, bitmap)

            pil_img = Image.fromarray(bitmap.astype('uint8')).convert('RGB')
            # display(pil_img)

            file.write('{} {}\n'.format(bitmap_file, label.replace(' ','')))
yusirhhh commented 2 years ago

你好,我想补充在hwdb和icdar2013上的实验。 我观察数据集的时候,对于不可见字符\0xff , 在你提供的解析代码中,直接’ignore‘了, label = [struct.pack('I', i).decode('gbk', 'ignore')[0] for i in label] 而这在原始图像中一般为异常类(显式不全,划去的字符)。这样的数据操作时采用直接略掉还是为这类字符设置一个新的类别(异常类)? 同时我发现在ICDAR2013中存在着 hwdb中不含有的字符,且有些字符特殊符号是不能转为半角的。同时对于hwdb数据的字符表设置多少类,我也感到有些疑惑。 我期待有一个统一的标准来处理这些问题。 希望得到您的意见。

JingyeChen commented 2 years ago

“而这在原始图像中一般为异常类(显式不全,划去的字符)。这样的数据操作时采用直接略掉还是为这类字符设置一个新的类别(异常类)?” 这类数据可以直接略掉

“同时我发现在ICDAR2013中存在着 hwdb中不含有的字符,且有些字符特殊符号是不能转为半角的。同时对于hwdb数据的字符表设置多少类,我也感到有些疑惑。 我期待有一个统一的标准来处理这些问题。” 无法转换为半角就不进行操作。字符表取训练集、验证集、测试集字符的并集。我们在构建benchmark的时候对四个数据集采用同一个alphabet是为了方便,只要alphabet涵盖所有出现的字符,character-based methods的性能是不会有差异的~

希望这些答复能解答您的疑虑