Resulting PDF is invalid (on M1 Macbook)

UB-Mannheim / zotero-ocr

Zotero Plugin for OCR

GNU Affero General Public License v3.0

551 stars 40 forks source link

Resulting PDF is invalid (on M1 Macbook) #27

Closed AndrewRRM closed 2 years ago

AndrewRRM commented 3 years ago

Not sure if this is the right place for a help request so feel free to move or delete this post. I've already posted over at the Zotero forum.

I cannot get this to work.

I've installed tesseract and poppler with Homebrew, installed the zotero plugin and set the path in the Zotero plugin to:

(/opt/homebrew/Cellar/tesseract/4.1.1 /opt/homebrew/Cellar/poppler/21.03.0_1/bin

I can confirm that this where those files do live.

I've also copied the pdftoppm into /Applications/Zotero.app/Contents/MacOS/pdftoppm according a recommendation on the Zotero forum, although I have tried to run the ocr with and without this step.

When I run the plugin, an ocr file appears but when I try to open it I get the following error:

Format Error: Not a PDF or corrupted.

PDF.js v2.8.146 (build: 7dd64325d) Message: Invalid PDF structure.

Help? ...

zuphilip commented 3 years ago

Can you save the extracted text in a text note?

It sounds like the steps are running through and then only the resulting PDF is invalid. Does this happen with every PDF you try? If yes, can you share the resulting PDF and the one you started with here as an example?

AndrewRRM commented 3 years ago

Sorry for the slow reply. Actually, I have no error anymore. In fact, nothing happens when I try to ocr now. Doesn't matter what pdf I attempt.

otheivan commented 3 years ago

Hi,

I have the same or a similar problem with Zotero on Windows 10.

When I run OCR a command prompt for pdftoppm pops up and just sits there without doing anything, even left it for about 30 minutes but nothing (I should say, it does something since the cpu seems to react, but I don't know what). When I close down pdftoppm a command prompt for tesseract pops up and seems to do its thing however, the pdf that it saves is corrupt. Tried a few pdf-readers just to be sure.

Also worth noting, if I turn on "Save output as a note" it does seem to work but not every time.

Running Zotero 5.0.96.2 with the latest Zotero OCR, tesseract and poppler for windows 21.03.0.

Let me know if you need anything else.

stweil commented 3 years ago

The "path" settings require the full path of the executables, not only the directory part. These settings work for me on MacOS with M1:

Bildschirmfoto 2021-06-06 um 16 36 59

I'd use different paths because the above settings would have to be updated each time after an upgrade of tesseract or poppler. In addition I'd also replace the default script by script/Latin which covers all Western European scripts:

Bildschirmfoto 2021-06-06 um 16 46 45

otheivan commented 3 years ago

Those are basically the settings I use, except for the script/Latin. (And that I'm on Win 10)

The thing is, everything initiates but it won't work. The resulting pdf always comes out corrupt and unreadable.

zuphilip commented 2 years ago

I am closing this, since the initial question by @AndrewRRM seems to be answered by @stweil.

@otheivan If you have still any issues, then please open another issue about it. I think it is easier to seperate the different operating systems on these issues.