dinosauria123 / gcv2hocr

gcv2hocr converts from Google Cloud Vision OCR output to hocr to make a searchable pdf.
99 stars 33 forks source link

Ported to python #3

Closed kba closed 7 years ago

kba commented 7 years ago

It's okay speed-wise and more flexible to adapt. has some heuristics on guessing lines and page area and english PDF renders well.

Not yet ready for merge but for discussion.

dinosauria123 commented 7 years ago

It is good to see enveloping the concept of the program.

I am not good for coding and no experience in Python.

License of gcv2hocr is CC, so you can use it any purpose. I hope gcv2hocr will help to make new Python scripts.

dinosauria123 commented 7 years ago

I want to say why I made gcv2hocr.

I am a old camera lover and want to make searchable database for old (WW2 era) Japanese camera advertisement.

Text extraction is difficult for old advertisements, but GCV gave good results (not perfect).

I hope new script will support my purpose.

Attached pdf is an example of old Japanese camera ad, it is partially searchable.

jptest-2.pdf

kba commented 7 years ago

I am not good for coding

I disagree, you've written a JSON parser in C to produce pretty flexible hOCR in a few days, not something I could do. :bow:

I am a old camera lover and want to make searchable database for old (WW2 era) Japanese camera advertisement.

That's a cool project! The example ad you uploaded is relatively clearly structured. Open source engines like tesseract should be able to handle that kind of layout. But I have no experience with CJK writing and OCR. Anyway, glad to see you're getting results.

dinosauria123 commented 7 years ago

I try to use your gcv2hocr.py. It seems to be good the python version make better output on web browser than my C version.

It is interesting for me, Japanese vertical text conversion is better on C code than python code comparing output of hocr-pdf. In the case of python code, searched text is marked at the bottom of the line.

Thank you again for your contributions.

jp_vert.hocr (C output).txt jp_vert.hocr (python output).txt jp_vert(from C output).pdf jp_vert (from python output).pdf

kba commented 7 years ago

It's a bug in hocr-pdf, it expects ocr_line to be a span but it's in a div in my code. Can you try with the hocr-pdf from this PR: https://github.com/UB-Mannheim/hocr-tools/pull/19

dinosauria123 commented 7 years ago

I tried to use UB-Mannheim/hocr-tools#19, this version fails to output pdf file with hocr file generated from C version of gcv2hocr (hocr output from python version works fine).

Original hocr-pdf works well with hocr file generated from C version of gcv2hocr.

Traceback (most recent call last): File "hocr-pdf", line 138, in export_pdf(sys.argv[1], 300) File "hocr-pdf", line 64, in export_pdf add_text_layer(pdf, image, height, dpi) File "hocr-pdf", line 83, in add_text_layer rawtext = word.xpath('./text()')[0] IndexError: list index out of range

jp_vert (C output).hocr.txt

kba commented 7 years ago

Hah, sorry quick changes at the moment. Most up-to-date version of hocr-pdf is currently https://github.com/kba/hocr-tools/blob/pdf-xml-parsing/hocr-pdf , this will probably land in tmbdev/hocr-tools tomorrow.