Open ctrngk opened 4 years ago
here is generated output.json from google cloud
Thank you for interesting information about text_detection_pdf.
Format of json is different OCRed image file from OCRed pdf file. So, gcv2hocr fails conversion json to hocr format.
Sorry, I don't have the plan to support pdf file, but I know an shell script that you want to do.
https://github.com/mah-jp/pdf4search
You may write gcv-pdf-json2hocr converter by yourself, It may be good project not only for you, but other persons who want to convert image only pdf to searchable.
When I look closely into two json format, the main difference is that "fullTextAnnotation" vs "textAnnotations" under "responses" entry.
"responses": { "textAnnotations": [
vs
"responses": [ { "fullTextAnnotation": { "pages": [
From Docs
A fullTextAnnotation is a structured hierarchical response for the UTF-8 text extracted from the image, organized as Pages→Blocks→Paragraphs→Words→Symbols
and
The previous textAnnotations OCR output will continue to be supported, and is available in the JSON Response as textAnnotations.
It is inferred that textAnnotations is old format while fullTextAnnotation is new. What I am looking for is fullTextAnnotation.json -> *.hocr translation or alternative visualization method. I will look into it if I have time.
Thank you for your comment.
"responses": { "textAnnotations": [ vs "responses": [ { "fullTextAnnotation": { "pages": [
You may change the main.c around line 68 to delete before "fullTextAnnotation". In addition, coordinate of the letter are normalized for OCRed pdf json fle.
Once I was tried to use "fullTextAnnotation" result for Japanese OCR.
In the case of "fullTextAnnotation" for Japanese, it detect each letter not each word. It is not useful for Japanese OCR. So, I will not commit your change to my gcv2hocr.
You may fork my cord to support "fullTextAnnotation" to make your version for gcv2hocr.
gcloud ml vision detect-text-pdf gs://my_bucket/input_file gs://my_bucket/out_put_prefix
, according to text_detection_pdfI prefer to upload pdf as a whole. It might be rendering better result, rather than splitting into hundreds of jpgs.