Open hanquansanren opened 2 years ago
Thanks for your nice concern and sorry for the late reply. I am so sorry that the OCR environment for DocTr is missed. However, you could follow the setting of our new work DocScanner. Specifically, the version of pytesseract is 0.3.8, and the version of Tesseract is recent 5.0.1.20220118. We follow the OCR evaluation settings of DewarpNet and DocTr, which use 50 and 60 document images of the DocUNet Benchmark dataset. The results are shown in Table 2.
Besides, I think it is unnecessary to annotate the GT string manually. This is because, if a distorted image is perfectly rectified, its recognized string should be consistent with the string recognized in the GT image. Hence, we just use the recognized string of the GT image as the reference string to calculate ED and CER. We provide our OCR evaluation code for you as follows,
def Levenshtein_Distance(str1, str2):
matrix = [[ i + j for j in range(len(str2) + 1)] for i in range(len(str1) + 1)]
for i in range(1, len(str1)+1):
for j in range(1, len(str2)+1):
if(str1[i-1] == str2[j-1]):
d = 0
else:
d = 1
matrix[i][j] = min(matrix[i-1][j]+1, matrix[i][j-1]+1, matrix[i-1][j-1]+d)
return matrix[len(str1)][len(str2)]
def cal_cer_ed(path_ours, tail='_rec'):
path_gt='./GT/'
N=66
cer1=[]
cer2=[]
ed1=[]
ed2=[]
check=[0 for _ in range(N+1)]
lis=[1,9,10,19,20,21,22,23,24,27,30,31,32,34,35,36,37,38,39,40,44,45,46,47,49] # dewarpnet
for i in range(1,N):
if i not in lis:
continue
gt=Image.open(path_gt+str(i)+'.png')
img1=Image.open(path_ours+str(i)+'_1' + tail)
img2=Image.open(path_ours+str(i)+'_2' + tail)
content_gt=pytesseract.image_to_string(gt)
content1=pytesseract.image_to_string(img1)
content2=pytesseract.image_to_string(img2)
l1=Levenshtein_Distance(content_gt,content1)
l2=Levenshtein_Distance(content_gt,content2)
ed1.append(l1)
ed2.append(l2)
cer1.append(l1/len(content_gt))
cer2.append(l2/len(content_gt))
check[i]=cer1[-1]
print('CER: ', (np.mean(cer1)+np.mean(cer2)) / 2.)
print('ED: ', (np.mean(ed1)+np.mean(ed2)) / 2.)
def evalu(path_ours, tail):
cal_cer_ed(path_ours, tail)
Hope this helps~!
Thanks a lot for your detailed explanation. Based on your code, Tesseract version and PyTesseract version, I have achieved the same CER performance in paper.
The DocScanner is another great work which achieves the best MS-SSIM, I will pay some time to follow it next step.
@hanquansanren Thanks for your feedback.
@fh2019ustc I'vd installed the corresponding version, but achived differenet ED value(607), while the CER value(0.20) is the same as in table2.
Eval dataset: DocUnet gt:scan images pred:crop images
@an1018 Hi, please use the OCR eval code in our repo, in which we have updated the image list used in the DewarpNet. Then you can obtain the performance as follows,
@an1018 For more OCR performance of other methods under the two settings (DocTr and DewarpNet), you can refer to the DocScanner.
@an1018 Hope to get your reply.
@fh2019ustc Yes,I use OCR_eval.py for evaluation,but there are still some problems: Q1: Why is the performace different from the performac in the DocTr paper
Q2:And the performance of DocTr in the following table is based on the geometric rectified results of GeoTr, not based on the illumination correction of IllTr?
Q3: I still can't get the same peformance by using the rectified images from Baidu Cloud
python OCR_eval.py --path_gt 'docunet/scan/' --path_ours 'Rectified_DocUNet_DocTr/' --tail ' copy_rec.png'
note:'docunet/scan/' is the scan images of docunet
Q4:How can I get the same result without using the rectified images from Baidu Cloud
python inference.py --distorrted_path 'docunet/crop/' --gsave_path './geo_rec' --isave_path './ill_rec/' --ill_rec True
python OCR_eval.py --path_gt 'docunet/scan/' --path_ours 'ill_rec/' --tail ' copy_ill.png'
@an1018 Note that In the DocUNet Benchmark, the '64_1.png' and '64_2.png' distorted images are rotated by 180 degrees, which do not match the GT documents. It is ignored by most of the existing works. Before the evaluation, please make a check. This dataset error is found in April this year when we are preparing our major for our PAMI submission DocScanner. But our DocTr is accepted in June of 2021. So we update the performance in our repo. Such an error is ignored by most of works in this field. So in our PAMI submission DocScanner and ECCV 2022 paper DocGeoNet, we update the performance of all previous methods.
@an1018 For your Q2, this performance is based on GeoTr.
@an1018 For Q3 and Q4, to reproduce the above performance, please use the geometric rectified images rather than the illumination corrected images.
@fh2019ustc Thanks for your quick response, I'll try again and give you feedback
@fh2019ustc Hi, I'vd installed Tesseract(v5.0.1) from Git, and downloaded the eng model. The performance is similar to the following performance, but there are still some differences. What else could be causing it?
CER: 0.1759 ED: 470.33
Here are some of my configurations: 1)images: gt images: the scan images of DocUNet pred images : Baidu Cloud in your repo
2)tesseract version:
3) eng model:
This is version information for your reference. Besides, what is your performance based on Setting 2?
1)How can I install 5.0.1.20220118, not 5.0.1?(My environment is Linux Ubuntu) 2)The performance based on Setting 2: ED:733.58 CER:0.1859
Oh, I can get the same performance in Windows environment. But for Ubuntu,I can't find Tesseract v5.0.1.20220118
@an1018 Thanks for your reply. For OCR evaluation, I think that you can compare the performance with the same environment, whether it is windows or ubuntu.
Yes, Thanks for your continuous technical support
Q1: Hello, in section5.1 of your paper, I notice you used Pytesseract V3.02.02, as shown in the above picture ↑ But on the homepage of pytesseract, I only find the version of 0.3.~ or 0.2.~, could you please tell me the detailed version you use. By the way, in the paper of DewarpNet, they specify the Pytesseract on version 0.2.9. Are there big differences caused by the version of OCR engine?
Q2: For the calculation of CER metric, it needs the ground true of each character in images, I also notice your repository provides 60 images index for OCR metric test, while the DewarpNet provided 25 images index as well as ground true in JSON form. Can you tell me how do you annotate the ground true? And if possible, can you share your ground true file?
In addition, I also noticed 25 ground trues in DewarpNet have several label errors, I guess they also use some OCR metric. If you also use OCR engine to label the ground true, can your some me more details about how do you annotate?
Q3: In fact, I also try to test the OCR performance over your model output. However, neither Pytesseract version 0.3.~ nor 0.2.~ achieve the same result in paper. Here is my OCR test code:
In brief, the core code for OCR is
h1=pytesseract.image_to_string(Image.open(h1str),lang='eng')
, with which I only get CER of 0.6. This result is far away from 0.2~0.3 CER as previous models.Could you share your OCR version and code for the OCR metric? Many thanks for your generous response!