KBNLresearch / ochre

Toolbox for OCR post-correction
Apache License 2.0
122 stars 18 forks source link

Error in align_output_to_input #5

Open omrishsu opened 6 years ago

omrishsu commented 6 years ago

In utils.py there is a try – except that try to align two strings. In case of an exception the code continues using the argument that was define in the try block.

def align_output_to_input(input_str, output_str, empty_char=u'@'):
    t_output_str = output_str.encode('ASCII', 'replace')
    t_input_str = input_str.encode('ASCII', 'replace')
    try:
        r = edlib.align(t_input_str, t_output_str, task='path')
    except:
        print(input_str)
        print(output_str)
    r1, r2 = align_characters(input_str, output_str, r.get('cigar'),
                              empty_char=empty_char, sanity_check=False)
    while len(r2) < len(input_str):
        r2.append(empty_char)
    return u''.join(r2)

I don’t know if this is acceptable to get an exception there but is so you can’t use r in the next statement. And if not, the try is redundant.

My input and output string (the prints) are: .rj-f - j r . m, w. 1 - .rj-f - j r . m, w. 1 –

And the exception is error: 'bytes' object has no attribute 'encode'

Please advice

Thanks!

jvdzwaan commented 6 years ago

The problem probably has to do with the fact that edlib expects a string instead of bytes. What version of Python are you using (edlib works best under Python 3). Also, ochre has the assumption that all input files are utf-8 encoded (this should be added to the documentation).

I thought you were interested in calculating performance, and not so much these experimental features of ochre.

Anyway, thank you for reporting the error. You are right that it should be changed :)

omrishsu commented 6 years ago

I'm using python 3.6. I've fixed it by commenting the following lines: t_output_str = output_str.encode('ASCII', 'replace') t_input_str = input_str.encode('ASCII', 'replace')

Well, I found this project interesting and i'm testing all of the functionality :)

jvdzwaan commented 6 years ago

Probably this encoding fix was only necessary for Python 2.7 (which I still use). Thank you!