fh2019ustc / DocTr

The official code for “DocTr: Document Image Transformer for Geometric Unwarping and Illumination Correction”, ACM MM, Oral Paper, 2021.
Other
345 stars 48 forks source link

About the OCR engine that you use, three questions need your help #9

Open hanquansanren opened 2 years ago

hanquansanren commented 2 years ago

image Q1: Hello, in section5.1 of your paper, I notice you used Pytesseract V3.02.02, as shown in the above picture ↑ But on the homepage of pytesseract, I only find the version of 0.3.~ or 0.2.~, could you please tell me the detailed version you use. By the way, in the paper of DewarpNet, they specify the Pytesseract on version 0.2.9. Are there big differences caused by the version of OCR engine?

Q2: For the calculation of CER metric, it needs the ground true of each character in images, I also notice your repository provides 60 images index for OCR metric test, while the DewarpNet provided 25 images index as well as ground true in JSON form. Can you tell me how do you annotate the ground true? And if possible, can you share your ground true file?

In addition, I also noticed 25 ground trues in DewarpNet have several label errors, I guess they also use some OCR metric. If you also use OCR engine to label the ground true, can your some me more details about how do you annotate?

Q3: In fact, I also try to test the OCR performance over your model output. However, neither Pytesseract version 0.3.~ nor 0.2.~ achieve the same result in paper. Here is my OCR test code:

from PIL import Image
import pytesseract

import json
import os
from os.path import join as pjoin
from pathlib import Path
import numpy as np

def edit_distance(str1, str2):
    """计算两个字符串之间的编辑距离。
    Args:
        str1: 字符串1。
        str2: 字符串2。
    Returns:
        dist: 编辑距离。
    """
    matrix = [[i + j for j in range(len(str2) + 1)] for i in range(len(str1) + 1)]
    for i in range(1, len(str1) + 1):
        for j in range(1, len(str2) + 1):
            if str1[i - 1] == str2[j - 1]:
                d = 0
            else:
                d = 1
            matrix[i][j] = min(matrix[i - 1][j] + 1, matrix[i][j - 1] + 1, matrix[i - 1][j - 1] + d)
    dist = matrix[len(str1)][len(str2)]
    return dist

def get_cer(src, trg):
    """把源字符串src修改成目标字符串trg的字符错误率。
    Args:
        src: 源字符串。
        trg: 目标字符串。
    Returns:
        cer: 字符错误率。
    """
    dist = edit_distance(src, trg)
    cer = dist / len(trg)
    return cer

if __name__ == "__main__":
    reference_list=[]
    reference_index=[] 
    img_dirList=[] 
    cer_list=[]  
    r_path = pjoin('./doctr/')
    reslut_file = open('result1.log', 'w')
    print(pytesseract.get_languages(config=''))
    with open('ocr_files.txt','r') as fr:   
        for l,line in enumerate(fr):
            reference_list.append(line)
            reference_index.append(l)
            print(len(line),line)
            print(len(line),line,file=reslut_file)
            h1str="./doctr/"+line[7:-1]+"_1 copy.png"
            h2str="./doctr/"+line[7:-1]+"_2 copy.png"
            print(h1str,h2str)
            h1=pytesseract.image_to_string(Image.open(h1str),lang='eng')
            h2=pytesseract.image_to_string(Image.open(h2str),lang='eng')

            with open('tess_gt.json','r') as file:
                str = file.read()
                r = json.loads(str).get(line[:-1])
            cer_value1=get_cer(h1, r)
            cer_value2=get_cer(h2, r)
            print(cer_value1,cer_value2)
            print(cer_value1,cer_value2,file=reslut_file)
            cer_list.append(cer_value1)
            cer_list.append(cer_value2)

    print(np.mean(cer_list)) 
    print(np.mean(cer_list),file=reslut_file)
    reslut_file.close()

In brief, the core code for OCR is h1=pytesseract.image_to_string(Image.open(h1str),lang='eng') , with which I only get CER of 0.6. This result is far away from 0.2~0.3 CER as previous models.

Could you share your OCR version and code for the OCR metric? Many thanks for your generous response!

fh2019ustc commented 2 years ago

Thanks for your nice concern and sorry for the late reply. I am so sorry that the OCR environment for DocTr is missed. However, you could follow the setting of our new work DocScanner. Specifically, the version of pytesseract is 0.3.8, and the version of Tesseract is recent 5.0.1.20220118. We follow the OCR evaluation settings of DewarpNet and DocTr, which use 50 and 60 document images of the DocUNet Benchmark dataset. The results are shown in Table 2.

Besides, I think it is unnecessary to annotate the GT string manually. This is because, if a distorted image is perfectly rectified, its recognized string should be consistent with the string recognized in the GT image. Hence, we just use the recognized string of the GT image as the reference string to calculate ED and CER. We provide our OCR evaluation code for you as follows,

def Levenshtein_Distance(str1, str2):
    matrix = [[ i + j for j in range(len(str2) + 1)] for i in range(len(str1) + 1)]
    for i in range(1, len(str1)+1):
        for j in range(1, len(str2)+1):
            if(str1[i-1] == str2[j-1]):
                d = 0
            else:
                d = 1 
            matrix[i][j] = min(matrix[i-1][j]+1, matrix[i][j-1]+1, matrix[i-1][j-1]+d)

    return matrix[len(str1)][len(str2)]

def cal_cer_ed(path_ours, tail='_rec'):
    path_gt='./GT/'
    N=66
    cer1=[]
    cer2=[]
    ed1=[]
    ed2=[]
    check=[0 for _ in range(N+1)]
    lis=[1,9,10,19,20,21,22,23,24,27,30,31,32,34,35,36,37,38,39,40,44,45,46,47,49]  # dewarpnet
    for i in range(1,N):
        if i not in lis:
            continue
        gt=Image.open(path_gt+str(i)+'.png')
        img1=Image.open(path_ours+str(i)+'_1' + tail)
        img2=Image.open(path_ours+str(i)+'_2' + tail)
        content_gt=pytesseract.image_to_string(gt)
        content1=pytesseract.image_to_string(img1)
        content2=pytesseract.image_to_string(img2)
        l1=Levenshtein_Distance(content_gt,content1)
        l2=Levenshtein_Distance(content_gt,content2)
        ed1.append(l1)
        ed2.append(l2)
        cer1.append(l1/len(content_gt))
        cer2.append(l2/len(content_gt))
        check[i]=cer1[-1]
    print('CER: ', (np.mean(cer1)+np.mean(cer2)) / 2.)
    print('ED:  ', (np.mean(ed1)+np.mean(ed2)) / 2.)

def evalu(path_ours, tail):
    cal_cer_ed(path_ours, tail)

Hope this helps~!

hanquansanren commented 2 years ago

Thanks a lot for your detailed explanation. Based on your code, Tesseract version and PyTesseract version, I have achieved the same CER performance in paper.

The DocScanner is another great work which achieves the best MS-SSIM, I will pay some time to follow it next step.

fh2019ustc commented 2 years ago

@hanquansanren Thanks for your feedback.

an1018 commented 1 year ago

@fh2019ustc I'vd installed the corresponding version, but achived differenet ED value(607), while the CER value(0.20) is the same as in table2. image

Eval dataset: DocUnet gt:scan images pred:crop images

fh2019ustc commented 1 year ago

@an1018 Hi, please use the OCR eval code in our repo, in which we have updated the image list used in the DewarpNet. Then you can obtain the performance as follows, image

fh2019ustc commented 1 year ago

@an1018 For more OCR performance of other methods under the two settings (DocTr and DewarpNet), you can refer to the DocScanner.

fh2019ustc commented 1 year ago

@an1018 Hope to get your reply.

an1018 commented 1 year ago

@fh2019ustc Yes,I use OCR_eval.py for evaluation,but there are still some problems: Q1: Why is the performace different from the performac in the DocTr paper image image

Q2:And the performance of DocTr in the following table is based on the geometric rectified results of GeoTr, not based on the illumination correction of IllTr? image

Q3: I still can't get the same peformance by using the rectified images from Baidu Cloud

python OCR_eval.py --path_gt 'docunet/scan/' --path_ours 'Rectified_DocUNet_DocTr/' --tail ' copy_rec.png'

note:'docunet/scan/' is the scan images of docunet image

Q4:How can I get the same result without using the rectified images from Baidu Cloud

python inference.py --distorrted_path 'docunet/crop/' --gsave_path './geo_rec' --isave_path './ill_rec/' --ill_rec True
python OCR_eval.py --path_gt 'docunet/scan/' --path_ours 'ill_rec/' --tail ' copy_ill.png'
fh2019ustc commented 1 year ago

@an1018 Note that In the DocUNet Benchmark, the '64_1.png' and '64_2.png' distorted images are rotated by 180 degrees, which do not match the GT documents. It is ignored by most of the existing works. Before the evaluation, please make a check. This dataset error is found in April this year when we are preparing our major for our PAMI submission DocScanner. But our DocTr is accepted in June of 2021. So we update the performance in our repo. Such an error is ignored by most of works in this field. So in our PAMI submission DocScanner and ECCV 2022 paper DocGeoNet, we update the performance of all previous methods.

fh2019ustc commented 1 year ago

@an1018 For your Q2, this performance is based on GeoTr. image

fh2019ustc commented 1 year ago

@an1018 For Q3 and Q4, to reproduce the above performance, please use the geometric rectified images rather than the illumination corrected images.

an1018 commented 1 year ago

@fh2019ustc Thanks for your quick response, I'll try again and give you feedback

an1018 commented 1 year ago

@fh2019ustc Hi, I'vd installed Tesseract(v5.0.1) from Git, and downloaded the eng model. The performance is similar to the following performance, but there are still some differences. What else could be causing it?

CER: 0.1759 ED: 470.33

image

Here are some of my configurations: 1)images: gt images: the scan images of DocUNet pred images : Baidu Cloud in your repo

image

2)tesseract version: image

3) eng model: image

fh2019ustc commented 1 year ago

image image This is version information for your reference. Besides, what is your performance based on Setting 2?

an1018 commented 1 year ago

1)How can I install 5.0.1.20220118, not 5.0.1?(My environment is Linux Ubuntu) 2)The performance based on Setting 2: ED:733.58 CER:0.1859

fh2019ustc commented 1 year ago

Hi, this is the link for Windows. Our enev is Windows. Hope to get your reply. This is the link for Ubuntu, but we do not have a try.

an1018 commented 1 year ago

Oh, I can get the same performance in Windows environment. But for Ubuntu,I can't find Tesseract v5.0.1.20220118

fh2019ustc commented 1 year ago

@an1018 Thanks for your reply. For OCR evaluation, I think that you can compare the performance with the same environment, whether it is windows or ubuntu.

an1018 commented 1 year ago

Yes, Thanks for your continuous technical support