gcv2hocr does not support scanned pdf

dinosauria123 / gcv2hocr

gcv2hocr converts from Google Cloud Vision OCR output to hocr to make a searchable pdf.

102 stars 31 forks source link

gcv2hocr does not support scanned pdf #32

Open ctrngk opened 4 years ago

ctrngk commented 4 years ago

save sample/jpn/jptest2.jpg as jptest2.pdf,
uploading to google vision (storage), and
generate output.json with gcloud ml vision detect-text-pdf gs://my_bucket/input_file gs://my_bucket/out_put_prefix, according to text_detection_pdf
download output.json
gcv2hocr output.json output.hocr The last step does not work.

I prefer to upload pdf as a whole. It might be rendering better result, rather than splitting into hundreds of jpgs.

ctrngk commented 4 years ago

here is generated output.json from google cloud

dinosauria123 commented 4 years ago

Thank you for interesting information about text_detection_pdf.

Format of json is different OCRed image file from OCRed pdf file. So, gcv2hocr fails conversion json to hocr format.

Sorry, I don't have the plan to support pdf file, but I know an shell script that you want to do.

https://github.com/mah-jp/pdf4search

You may write gcv-pdf-json2hocr converter by yourself, It may be good project not only for you, but other persons who want to convert image only pdf to searchable.

ctrngk commented 4 years ago

When I look closely into two json format, the main difference is that "fullTextAnnotation" vs "textAnnotations" under "responses" entry.

"responses": { "textAnnotations": [ vs "responses": [ { "fullTextAnnotation": { "pages": [

From Docs

A fullTextAnnotation is a structured hierarchical response for the UTF-8 text extracted from the image, organized as Pages→Blocks→Paragraphs→Words→Symbols

and

The previous textAnnotations OCR output will continue to be supported, and is available in the JSON Response as textAnnotations.

It is inferred that textAnnotations is old format while fullTextAnnotation is new. What I am looking for is fullTextAnnotation.json -> *.hocr translation or alternative visualization method. I will look into it if I have time.

dinosauria123 commented 4 years ago

Thank you for your comment.

"responses": { "textAnnotations": [ vs "responses": [ { "fullTextAnnotation": { "pages": [

You may change the main.c around line 68 to delete before "fullTextAnnotation". In addition, coordinate of the letter are normalized for OCRed pdf json fle.

Once I was tried to use "fullTextAnnotation" result for Japanese OCR.

https://github.com/dinosauria123/gcv2hocr/commit/db6830fbff4247e35b39869c2a33b32bd4913d22#diff-2045016cb90d1e65d71c2407a2570927

In the case of "fullTextAnnotation" for Japanese, it detect each letter not each word. It is not useful for Japanese OCR. So, I will not commit your change to my gcv2hocr.

You may fork my cord to support "fullTextAnnotation" to make your version for gcv2hocr.