dinosauria123 / gcv2hocr

gcv2hocr converts from Google Cloud Vision OCR output to hocr to make a searchable pdf.
99 stars 33 forks source link

trying to upload the image and while generating the hocr format getting this issue #15

Open guptaaman2011 opened 6 years ago

guptaaman2011 commented 6 years ago

python gcv2hocr.py Capture.jpg.json > capture.hocr Traceback (most recent call last): File "gcv2hocr.py", line 146, in page = fromResponse(resp, **args.dict) File "gcv2hocr.py", line 99, in fromResponse word.htmlid="word%d%d" % (len(page.content) - 1, len(curline.content)) AttributeError: 'NoneType' object has no attribute 'content'

capture
dinosauria123 commented 6 years ago

Thank you for using gcv2hocr.

please upload your Capture.jpg.json.

How to use makepdf.sh

  1. Go to the same place at makepdf.sh
  2. Execute " sh ./makepdf.sh "

You have to edit makepdf.sh before execute. In the first line of makepdf.sh "while [ $a -le 32 ]" this says you have page001.jpg to page032.jpg. You may want to convert different number of jpegs, If you have only one jpeg, You just edit the first line of makepdf.sh as "while [ $a -le 1 ]"

guptaaman2011 commented 6 years ago

thanks for quick update I am new to ocr technology and just checking the scope of it.Found very interesting

On Fri, Mar 9, 2018 at 5:02 AM, dinosauria123 notifications@github.com wrote:

Thank you for using gcv2hocr.

please upload your Capture.jpg.json.

How to use makepdf.sh

  1. Go to the same place at makepdf.sh
  2. Execute " sh ./makepdf.sh "

You have to edit makepdf.sh before execute. In the first line of makepdf.sh "while [ $a -le 32 ]" this says you have page001.jpg to page032.jpg. You may want to convert different number of jpegs, If you have only one jpeg, You just edit the first line of makepdf.sh as "while [ $a -le 1 ]"

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dinosauria123/gcv2hocr/issues/15#issuecomment-371661657, or mute the thread https://github.com/notifications/unsubscribe-auth/AMaNOV0NPxmcJbMwEIdxg6-f54S6Lkutks5tcb-LgaJpZM4SjMqH .

-- https://bottr.me/amangupta577?utm_source=emailSignature

Aman Gupta

@amangupta577 https://bottr.me/amangupta577?utm_source=emailSignature

https://www.facebook.com/app_scoped_user_id/1747714118589975/

guptaaman2011 commented 6 years ago

Hi dinosauria123 i want to convert hocr format to different format xls,xml,pdf,docx is there any tool or script there.

On Fri, Mar 9, 2018 at 5:05 AM, aman gupta guptaaman702@gmail.com wrote:

thanks for quick update I am new to ocr technology and just checking the scope of it.Found very interesting

On Fri, Mar 9, 2018 at 5:02 AM, dinosauria123 notifications@github.com wrote:

Thank you for using gcv2hocr.

please upload your Capture.jpg.json.

How to use makepdf.sh

  1. Go to the same place at makepdf.sh
  2. Execute " sh ./makepdf.sh "

You have to edit makepdf.sh before execute. In the first line of makepdf.sh "while [ $a -le 32 ]" this says you have page001.jpg to page032.jpg. You may want to convert different number of jpegs, If you have only one jpeg, You just edit the first line of makepdf.sh as "while [ $a -le 1 ]"

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dinosauria123/gcv2hocr/issues/15#issuecomment-371661657, or mute the thread https://github.com/notifications/unsubscribe-auth/AMaNOV0NPxmcJbMwEIdxg6-f54S6Lkutks5tcb-LgaJpZM4SjMqH .

-- https://bottr.me/amangupta577?utm_source=emailSignature

Aman Gupta

@amangupta577 https://bottr.me/amangupta577?utm_source=emailSignature

https://www.facebook.com/app_scoped_user_id/1747714118589975/

-- https://bottr.me/amangupta577?utm_source=emailSignature

Aman Gupta

@amangupta577 https://bottr.me/amangupta577?utm_source=emailSignature

https://www.facebook.com/app_scoped_user_id/1747714118589975/

dinosauria123 commented 6 years ago

This is what you may want ?

https://www.zotero.org/support/dev/translators

dinosauria123 commented 6 years ago

Or this one ?

https://hub.docker.com/r/ubma/ocr-fileformat/

guptaaman2011 commented 6 years ago

I dont get it it dont have hocr format in it

On Fri, Mar 9, 2018 at 5:15 AM, dinosauria123 notifications@github.com wrote:

This is what you may want ?

https://www.zotero.org/support/dev/translators

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dinosauria123/gcv2hocr/issues/15#issuecomment-371664344, or mute the thread https://github.com/notifications/unsubscribe-auth/AMaNOWE5yE0UgGHL49Sei6RQWFV557bBks5tccKVgaJpZM4SjMqH .

-- https://bottr.me/amangupta577?utm_source=emailSignature

Aman Gupta

@amangupta577 https://bottr.me/amangupta577?utm_source=emailSignature

https://www.facebook.com/app_scoped_user_id/1747714118589975/

dinosauria123 commented 6 years ago

Do you want to convert images to hocr ?

You may use Tesseract OCR.

https://github.com/tesseract-ocr/tesseract

dinosauria123 commented 6 years ago

Check here.

https://github.com/tesseract-ocr/tesseract/wiki/Command-Line-Usage#hocr-output

guptaaman2011 commented 6 years ago

no i got the hocr format , i see i can convert it to pdf but the challenge now is i want to convert this hocr to different formats like xml,txt,docx,xls extensions .

On Fri, Mar 9, 2018 at 5:22 AM, dinosauria123 notifications@github.com wrote:

Do you want to convert images to hocr ?

You may use Tesseract OCR.

https://github.com/tesseract-ocr/tesseract

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dinosauria123/gcv2hocr/issues/15#issuecomment-371665529, or mute the thread https://github.com/notifications/unsubscribe-auth/AMaNOVZ61Ty1nSMc7JfggjSJoEXPE7Kbks5tccQkgaJpZM4SjMqH .

-- https://bottr.me/amangupta577?utm_source=emailSignature

Aman Gupta

@amangupta577 https://bottr.me/amangupta577?utm_source=emailSignature

https://www.facebook.com/app_scoped_user_id/1747714118589975/

dinosauria123 commented 6 years ago

I think you have to use multiple tools. for example, hocr to pdf is possible hocr-tools. https://github.com/tmbdev/hocr-tools#hocr-pdf

pdf may have many tools to convert to other format...

guptaaman2011 commented 6 years ago

yes i was trying that but after trying to change online recongized pdf into excel format , its saying cant detect the file and not changing to xls so stuck here

On Fri, Mar 9, 2018 at 5:30 AM, dinosauria123 notifications@github.com wrote:

I think you have to use multiple tools. for example, hocr to pdf is possible hocr-tools. https://github.com/tmbdev/hocr-tools#hocr-pdf

pdf may have many tools to convert to other format...

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dinosauria123/gcv2hocr/issues/15#issuecomment-371667106, or mute the thread https://github.com/notifications/unsubscribe-auth/AMaNOZgBUgAG6Y6ONyvZ2RS-hWA0rNdAks5tccYigaJpZM4SjMqH .

-- https://bottr.me/amangupta577?utm_source=emailSignature

Aman Gupta

@amangupta577 https://bottr.me/amangupta577?utm_source=emailSignature

https://www.facebook.com/app_scoped_user_id/1747714118589975/

dinosauria123 commented 6 years ago

Do you know Alto ? https://en.wikipedia.org/wiki/ALTO_(XML)

If you want to deal with OCR format, Alto is better than hocr.

https://github.com/altoxml/documentation/wiki/Software

guptaaman2011 commented 6 years ago

Dear User, Your file "scanned.pdf" contains scanned or image textual data. Converting this PDF requires OCR to successfully complete the conversion and retrieve the text. This feature is exclusively available to our Cometdocs Premium Users. Learn more about how to become a premium user here: http://www.cometdocs.com/user/subscriptions Best Regards, Cometdocs Team. Privacy Policy http://www.cometdocs.com/privacy-policy.html 21530700 Ontario Inc 102A-1075 Bay Street, Suite 324, Toronto, ON, M5S 2B2 https://maps.google.com/?q=1075+Bay+Street,+Suite+324,+Toronto,+ON,+M5S+2B2&entry=gmail&source=g

GOT THIS FYI

On Fri, Mar 9, 2018 at 5:53 AM, dinosauria123 notifications@github.com wrote:

More easy ways, Google Drive converts pdf to Excel files.

https://techtites.com/convert-pdf-google-drive/

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dinosauria123/gcv2hocr/issues/15#issuecomment-371671429, or mute the thread https://github.com/notifications/unsubscribe-auth/AMaNOeoy2U06IeqM1YM4R6TuCbcfefDsks5tcctmgaJpZM4SjMqH .

-- https://bottr.me/amangupta577?utm_source=emailSignature

Aman Gupta

@amangupta577 https://bottr.me/amangupta577?utm_source=emailSignature

https://www.facebook.com/app_scoped_user_id/1747714118589975/

dinosauria123 commented 6 years ago

I never used this, but I think it is what you want ... https://github.com/tabulapdf/tabula-extractor

http://tabula.technology/

I think this topic is not related to gcv2hocr, may I close this issue ?